310 likes | 644 Views
Dynamic Branch Prediction. Ali Azarpeyvand. Tomasulo Review. Reservations stations: renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW
E N D
Dynamic Branch Prediction Ali Azarpeyvand
Tomasulo Review • Reservations stations: renaming to larger set of registers + buffering source operands • Prevents registers as bottleneck • Avoids WAR, WAW hazards of Scoreboard • Allows loop unrolling in HW • Not limited to basic blocks (integer units gets ahead, beyond branches) • Lasting Contributions • Dynamic scheduling • Register renaming • Load/store disambiguation • 360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264
Outline • Dynamic Branch Prediction • Branch prediction buffer or branch history table • Correlating branch predictors • Tournament predictors • Branch target buffers • Integrated Instruction fetch unit • Return address predictors
Dynamic Branch Prediction • Performance = ƒ(accuracy, cost of misprediction) • Branch History Table (branch-prediction buffer) is simplest • Lower bits of PC address index table of 1-bit values • Says whether or not branch taken last time • No address check • Problem: in a loop, 1-bit BHT will cause two mispredictions (example: 9 iterations before exit 80%): • Solution 2 bit
Dynamic Branch Prediction • Solution: 2-bit scheme where change prediction only if get mispredictiontwice: • Dark: stop, not taken • Light: go, taken
BHT Accuracy • Mispredict because either: • Wrong guess for that branch • Got branch history of wrong branch when index the table • 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%, • 4096 about as good as infinite table(in Alpha 21164), • Branch penalty and branch frequency are also important
BHT Accuracy 4096 entry, two bit prediction
Correlating Branches • Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch • Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table • In general, (m,n) predictor means record last m branches to select between 2m history tables each with n-bit counters • Old 2-bit BHT is then a (0,2) predictor
Examples Code from eqntottfrom SPEC92 b3 has correlation with b1, b2
Branch Prediction Result 1 bit predictor, (d is 0 or 2)
Correlating Prediction Performance One bit predictor with one bit correlation
Correlating Branches (2,2) predictor • Then behavior of recent branches selects between, say, four predictions of next branch, updating just that prediction • Simple implementation: • global history can be stored in a shift register Branch address is concatenated withglobal branch history and then indexed.
Number of Stored Bits • For an (m,n) predictor: • 2^m * n * Number of prediction entries • Example: • 2-bit predictor with 4096 entries: • 2^0 * 2 * 4k = 8k • (2,2) predictor, how many entries to be 8k: • 2^2 * 2 * x = 8k x = 1k • Comparison in the next slide
Accuracy of Different Schemes 18% 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT Frequency of Mispredictions 0%
PC Local Predictor Global Predictor Choice Predictor mux Global history NT/T Tournament Branch Predictor • Used in Alpha 21264: Track both “local” and global history • Intended for mixed types of applications • Global history: T/NT history of past k branches, e.g. 0 1 0 1 0 1 (NT T NT T NT T)
Global history12-bit Counters(4Kx2) NT/T 12 1 Counters(4Kx2) NT/T 0 1 0 1 0 1 0 1 0 1 0 1 local/global 1 Tournament Branch Predictor • Local predictor: use 10-bit local history, 3-bit counters • Global and choice predictors: PC Local historytable (1Kx10) Counters (1Kx3) NT/T 10 1
Reducing Branch Stalls • In MIPS, branch predicted as taken • We need the target address • High Performance Instruction Delivery • Branch target buffer • integrated instruction fetch unit • predicting return addresses
Need Address at Same Time as Prediction • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)
Example Prediction accuracy: 90% (for instructions in the buffer) Hit rate in the buffer: 90% (for branches predicted taken) Taken branch frequency: 60% Probability (branch in buffer, but actually not taken) = Percent buffer hit rate × Percent incorrect predictions=90% × 10%=0.09 Probability (branch not in buffer, but actually taken) = 10% × 60%=0.06 Branch penalty =(0.09 + 0.06)× 2 Branch penalty = 0.30
Branch Folding • Idea: to store one or more target instructions • instead of, or in addition to, the predicted target address. • Advantages: • it allows the branch-target buffer access to take longer than the time between successive instruction fetches • allows us to perform an optimization called branch folding • Branch Folding: • zero-cycle unconditional branches, and sometimes zero-cycle conditional branches.
Branch PC Predicted PC PC of instruction FETCH =? Extra prediction state bits Yes: instruction is branch and use predicted PC as next PC No: branch not predicted, proceed normally (Next PC = PC+4) Branch Target Buffer (summary) • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) • Note: must check for branch match now, since can’t use wrong branch address • Example: BTB combined with BHT
Return Addresses Prediction • Register indirect branch hard to predict address • If we use branch prediction buffer techniques in this situation doesn’t work: • Many callers, one callee • Jump to multiple return addresses from a single address (no PC-target correlation) • SPEC89 85% such branches for procedure return • Use stack discipline for procedures, save return address in small buffer that acts like a stack: 8 to 16 entries has small miss rate
Short Seminar • Section 2.10 on Pentium 4, Branch prediction • Pentium 4 Tomasulo
Dynamic Branch Prediction Summary • Prediction becoming important part of scalar execution. • Branch History Table: 2 bits for loop accuracy. • Correlation: Recently executed branches correlated with next branch. • Either different branches. • Or different executions of same branches. • Tournament Predictor: more resources to competitive solutions and pick between them. • Branch Target Buffer: include branch address & prediction. • Return address stack for prediction of indirect jump.