380 likes | 398 Views
Explore the optimization of conditional branch predictors for improved program execution in clustered indexing. Understand the impact of branches on pipelined architectures and learn about prediction solutions. Discover how clusters and subtables in prediction tables enhance accuracy and efficiency in predicting branch outcomes.
E N D
Clustered Indexingfor Conditional Branch Predictors Veerle Desmet Ghent University Belgium
Clustered Indexingfor Conditional Branch Predictors Veerle Desmet Ghent University Belgium
Conditional Branches for (i=0; i<50; i++) { /* a loop... */ } /* next statements */ How frequent do conditional branches occur? if (i > 0) /* something */ else /* something else */ 1/8
Program Execution • Fetch = take next instruction • Decode = analyze type and read operands • Execute • Write Back = write result Fetch Decode Execute Write Back R1=R2+R3 addition 4 3 computation R1 contains 7
R5=R2+1 R1=R2+R3 R4=R3-1 R5=R2+1 R1=R2+R3 R7=2*R1 R4=R3-1 R5=R2+1 R1=R2+R3 R5=R6 R7=2*R1 R4=R3-1 R5=R2+1 R1>0 R5=R6 R7=2*R1 R4=R3-1 Pipelined architectures Parallel versus sequential: • Constant flow of instructions possible • Faster applications • Limitation due to conditional branches Fetch Decode Execute Write Back R1=R2+R3
R1=R2+R3 R5=R6 R5=R2+1 if R1>0 R7=2*R1 then R2=R2-1 R7=0 else ? R5=R2+1 R4=R3-1 R7=2*R1 R5=R6 if R1>0 R1=R2+R3 R5=R2+1 R5=R6 R1=R2+R3 if R1>0 R5=R2+1 R5=R6 ? ? ifR1>0 R5=R2+1 R7=2*R1 if R1>0 Problem: Branches • Branches introduce bubbles • Affects pipeline throughput Fetch Decode Execute Write Back
R1=R2+R3 R5=R6 R5=R2+1 if R1>0 R7=2*R1 then R2=R2-1 R7=0 else R4=R3-1 R7=2*R1 R5=R6 if R1>0 R1=R2+R3 R5=R2+1 R5=R2+1 R5=R6 R1=R2+R3 R7=2*R1 if R1>0 R5=R2+1 R5=R6 R2=R2-1 R7=2*R1 ifR1>0 R5=R2+1 Solution: Prediction • Fetch those instructions that are likely to be executed Fetch Decode Execute Write Back correct prediction = gain misprediction = penalty
Branch predictor Nowaday’s Architecture functional unit functional unit instruction cache fetch decode register rename dispatch instruction window functional unit register file functional unit re- order logic IPC
Clustered Indexingfor Conditional Branch Predictors Veerle Desmet Ghent University Belgium
Predict outcome of condition e.g. if or else based on unique branch address Update prediction table Bimodal Branch Predictor prediction table Branch address k
Global History Branch Predictor prediction table • Predict outcome of condition • e.g. for loop • based on global history • 111101111011110 • Update prediction table and global history Global history k
Gshare Branch Predictor prediction table [McFarling] Global history XOR Branch address Original index k
Misprediction rate: gshare SPEC INT 2000 25 20 15 misprediction rate 10 5 better 0 10 100 1000 10000 100000 1000000 predictor size (bytes)
Aliasing prediction table • Resource limitations: • 8 entries, index = 3 bits • index 101 • Two different branches using the same prediction information 3 bit index A Index=101 B Index=101
50 destructive 45 40 constructive 35 neutral 30 25 alias rate (%) 20 15 10 5 0 16 32 64 256 512 128 1024 2048 4096 8192 16384 26214 52428 32768 13107 65536 predictor size (bytes) Aliasing SPEC INT 2000
ClusteredIndexingfor Conditional Branch Predictors Veerle Desmet Ghent University Belgium
Basic Observations • Branches with similar behavior can share prediction information • 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 • 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 • Branches can use same table entry, e.g. • 1 1 1 1 0 0 0 0 • 1 1 1 1 0 1 0 time
Time Varying Behavior A: B: C: D: 1 1 1 10 0 00 1 11 10 1 0 1 1 1 11 000 0 1 11 10 1 0 1 1 1 11 0 0 10 0 1 11 10 1 0 phase phase phase phase 100% 0% 100% 50% 100%0% 100% 60% 100%25% 0% NE NE NE 100% 33% A: B: C: D: NE = not executed
Each branch represents a point in N-dim space Clusters formed by k-means algorithm Branch Clustering 100% 0% 100% 50% 100%0% 100% 60% 100%25% 0% NE NE NE 100% 33% A: B: C: D:
X X X X 1. initial centers 2. calculate nearest center X X X X X X 4. Restart with new centers 3. redefine centers k-Means Cluster Algorithm
X X X X X X Stable solution k-Means Cluster Algorithm X X 1. initial centers 2. calculate nearest centers 3. redefine centers
X X X X X Stable solution with k=2 Stable solution with k=3 Determining k of k-Means k is chosen by BIC-score (Bayesian Information Criterion) • Tradeoff between k and goodness of a clustering best?
SPEC INT 2000 from 8 to 33 clusters mcf: 8 gcc, parser: 33 Each branch belongs to exactly one cluster Branch Clustering 100% 0% 100% 50% 100% 0% 100% 60% 100% 25% 0% NE NE NE 100% 33% A: B: C: D: Cluster Cluster Cluster Cluster
Clustered Indexingfor Conditional Branch Predictors Veerle Desmet Ghent University Belgium
Subtables prediction table • Example • 8 entries, index = 3 bits • 4 clusters, 2 bits • Original index 101 Cluster Index = 1 3
Subtables prediction table • Example • 8 entries, index = 3 bits • 4 clusters, 2 bits • Original index 101 Cluster Index = 1 3
Subtables prediction table • Example • 8 entries, index = 3 bits • 4 clusters, 2 bits • Original index 101 • 3 to 6 bits for cluster [SPECint2000] • can be used in every predictor scheme Cluster Index = 1 3
25 bimodal original bimodal clustered 20 15 misprediction rate 10 5 0 10 100 1000 10000 100000 1000000 predictor size (bytes) Subtables for Bimodal prediction table Cluster Branch addr
25 gshare original gshare clustered 20 15 misprediction rate 10 5 0 10 100 1000 10000 100000 1000000 predictor size (bytes) Subtables for Gshare Global history prediction table Cluster Branch addr 19% better for SMALL predictors
Why Clustered Indexing Works • Subtabling • Uses smaller predictors • More aliasing expected… but • More constructive aliasing
Hashing: Alternative to Subtables prediction table • Keeps original global history length Global history Branch addr Cluster Gshare ix index
Hashing for Gshare 25 gshare original 20 gshare clustered: subtables gshare clustered: hashed 15 5% better for LARGE predictors misprediction rate 10 5 7,5 gshare original 7 0 gshare clustered: subtables 10 100 1000 10000 100000 1000000 6,5 predictor size (bytes) gshare clustered: hashed 6 5,5 misprediction rate 5 4,5 4 3,5 1000 10000 100000 1000000 predictor size (bytes)
Self Profile-Based Clustering A: B: C: D: • Limit study • Identified clusters optimal for given execution 100% 0% 100% 50% 100% 0% 100% 60% 100% 25% 0% NE NE NE 100% 33% Cluster Cluster Cluster Cluster
additional cluster for unseen branches Cluster Cross Profile-Based Clustering A: B: C: D: 100% 0% 100% 50% 100% 0% 100% 60% 100% 25% 0% NE NE NE 100% 33% Cluster SELF Cluster Cluster Cluster SPEC-train inputs A: B: C: D: E: 90% 10% 100% 60% NE NE NENE 100% 25% NE NE NE NE 100% 33% 0% 0% 10% 20% Cluster OK Cluster Cluster Cluster
25 25 bimodal original gshare original bimodal self clustered gshare self clustered 20 20 bimodal cross clustered gshare cross clustered 15 15 misprediction rate misprediction rate 10 10 5 5 0 0 10 100 1000 10000 100000 1000000 10 100 1000 10000 100000 1000000 predictor size (bytes) predictor size (bytes) 7,5 gshare original 7 gshare self clustered: subtables gshare self clustered: hashed 6,5 gshare cross clustered: subtables gshare cross clustered: hashed 6 5,5 misprediction rate 5 4,5 4 3,5 1000 10000 100000 1000000 predictor size (bytes) Cross Profile-Based Clustering cross clustered still good GSHARE @ small budgets: subtables 12.3% less mispredictions (19% self clustered) @ large budgets: hashing 3% better (5% self clustered)
Conclusion • Small branch predictors suffer from aliasing • frequently destructive • Exploit constructive aliasing • by clustering branches • Implementation • subtables (can be used in all branch prediction schemes) • hashing (specific for gshare) • Gshare misprediction rate @ 1KiB: reduced by 19% (self), 12.3% (cross) @ 256KiB: reduced by 5% (self), 3% (cross)