310 likes | 426 Views
Computer Architecture Principles Dr. Mike Frank. CDA 5155 Summer 2003 Module #22 Dynamic Branch Prediction. Dynamic Branch Prediction. Dynamic Branch Prediction (3.4). As the amount of ILP exploited increases (CPI decreases), impact of control stalls increases. Branches come more often
E N D
Computer Architecture PrinciplesDr. Mike Frank CDA 5155Summer 2003 Module #22Dynamic Branch Prediction
Dynamic Branch Prediction (3.4) • As the amount of ILP exploited increases (CPI decreases), impact of control stalls increases. • Branches come more often • An n-cycle delay postpones more instructions • Dynamic Hardware Branch Prediction • “Learns” which branches are taken, or not • Make the right guess (most of the time) about whether a branch is taken, or not. • Delay depends on whether prediction is correct, and whether branch is taken.
Branch-Prediction Buffers (BPB) • Also called “branch history table” • Low-order n bits of branch address used to index a table of branch history data. • May have “collisions” between distant branches. • Associative tables also possible • In each entry, k bits of information about history of that branch are stored. • Common values of k: 1, 2, and larger • Entry is used to predict what branch will do. • Actual behavior of branch will update the entry.
1-bit Branch Prediction • The entry for a branch has only two states: • Bit = 1 • “The last time this branch was encountered, it was taken. I predict it will be taken next time.” • Bit = 0 • “The last time this branch was encountered, it was not taken. I predict it will not be taken next time.” • Will make 2 mistakes each time a loop is encountered. • At the end of the first & last iterations. • May always mispredict in pathological cases!
2-bit Branch Prediction • 2 bits 4 states • Commonly used to code the most recent branch outcome, & the most recent run of 2 consecutive identical outcomes. • Strategy: • Prediction mirrors mostrecent run of 2. • Only 1 mis-prediction perloop execution,after the first timethe loop is reached. • On last iteration
State Transition Diagram State #3 (11) State #2 (10) State #1 (01) State #0 (00)
Misprediction rate for 2-bit BPB (with 4,096 entries)
n-bit Branch Prediction One commonly tried scheme: • Each entry contains an integer in [0, 2n1]. • After branch execution, if the branch was taken, • then: entry min(entry+1, 2n1) ; increment • else: entry max(entry1, 0) ; decrement • If entry < ½2n, then predict not taken, • else predict taken. • Effectively does the following: • Averages branch behavior over a long time, and • Predicts the more frequently occurring outcome • Empirically, not much better than 2-bit!
Ideal Branch Prediction? • More sophisticated schemes are certainly theoretically possible… • Could recognize simple patterns of branches • E.g., TNTNTNTNTN (T=Taken, N=Not taken) • However, totally general, optimal prediction is essentially equivalent to the general learning problem in its difficulty! • Ideal branch prediction is uncomputable statically. • It is apparently impossible to even objectively define “ideal” dynamic prediction, • and it’s intractable to compute it under many specific (and subjectively motivated) definitions.
Implementing Branch Histories • Separate “cache” accessed during IF • Extra bits in instruction cache • Problem with this approach, in the simple RISC pipeline we’ve been studying: • After fetch, don’t know whether the instruction is really a branch or not (until decoding) • Also don’t know the target address. • By the time you know these things (in ID), you already know whether it’s really taken! • Haven’t saved any time! • Branch-Target Buffers can fix this problem (later)...
Branch-Prediction Performance • Contribution to cycle count depends on: • Branch frequency & misprediction frequency • Freqs. of taken/not taken, predicted/mispredicted. • Delay of taken/not taken, predicted/mispredicted. • How to reduce misprediction frequency? • Increase buffer size to avoid collisions. • Empirically, has little effect beyond ~4,096 entries. • Increase prediction accuracy • Increase # of bits/entry (little effect beyond 2) • Use a different prediction scheme • e.g., correlated predictors, which we will now discuss…
Correlated Prediction - example • Code fragment from eqntott: if (aa==2) aa=0; if (bb==2) bb=0; if (aa!=bb) { … • Simple RISC code (aa=R1, bb=R2): SUBUI R3,R1,#2 ;(aa-2) BNEZ R3,L1 ;branch b1 (aa!=2) ADD R1,R0,R0 ; aa=0 L1: SUBUI R3,R2,#2 ;(bb-2) BNEZ R3,L2 ;branch b2 (bb!=2) ADD R2,R0,R0 ; bb=0 L2: SUBU R3,R1,R2 ;(aa-bb) BEQZ R3,L3 … ;branch b3 (aa==bb) Note that if b1 and b2are both untaken, b3will be taken.
Even simpler example b1 untakenimplies b2untaken • C code: if (d==0) d=1; if (d==1) … • Simple RISC code (d=R1): BNEZ R1,L1 ;b1: d!=0 ADDI R1,R0,#1 ;d=1 L1: SUBUI(3x) R3,R1,#1 ;(d-1) BNEZ R3,L2 ;b2: d!=1
Behavior w. 1-bit predictor • Suppose initial value of d alternates between 2 and 0. • All branches are mispredicted!
Correlating Predictors • Have different predictions for the current branch depending on the previously executed branch instruction was taken or not. • Notation: _ / _ What to predictif the last branchwas NOT taken What to predictif the last branchwas TAKEN Prediction usedis shown in bold
(m,n) correlated predictors • Uses the behavior of the most recent m branches encountered to select one of 2m different branch predictors for the next branch. • Each of these predictors records n bits of history information for any given branch. • On previous slide we saw a (1,1) predictor. • Easy to implement: • Behavior of last m branches: an m-bit shift register • Branch-prediction buffer: access with low-order bits of branch address, concatenated with shift register.
Correlated predictor schematic Rowselect Column select New branch outcome shifted in (shift register)
Branch-Target Buffers (BTB) • How to know the address of the next instruction as soon as the current instruction is fetched? • Normally, an extra (ID) cycle is needed to: • Determine that the first instruction is a branch • Determine whether the branch is taken • Compute the target address PC+offset • Branch prediction alone doesn’t help DLX • What if, instead, the next instruction address could be fetched at the same time that the current instruction is fetched? BTB
Penalties in different cases • Using a BTB. • If instruction not in BTB and branch not taken (case not shown), penalty is 0.
Branch-Target Buffer Variants • Store target instructions instead of their addresses! • Saves on fetch time. • Permits branch folding - zero-cycle branches! • Substitute destination instruction for branch in pipeline! • Predicting register/indirect branches • E.g., abstract function calls, switch statements, procedure returns. • CPU-internal return-address stack
Branch Prediction Styles • Local predictors (e.g., simple 2-bit) look only at the history of the particular branch in question • Global (e.g., correlating) predictors also look at other events that have happened in context • e.g., history of recent branch outcomes • Tournament predictors operate several branch predictors in parallel, • e.g., 1 local and 1 global, • and dynamically learn which one performs best for a given branch. • Tournament predictors are one type of multilevel branch predictors • These have 2 or more levels of branch-prediction tables
FSM for Tournament Predictor Predictor 1/2 result status: 1 = prediction correct0 = prediction incorrect Counter = 3 Counter = 0 Counter = 2 Counter = 1 If predictor 1 is correct, counter = min(counter+1,3); If predictor 2 is correct, counter = max(counter-1,0)
When the best predictor fails... • Even the best branch predictors have a non-zero miss rate! • What else can you do to improve these cases? • Another approach: Reduce miss penalty to zero. • One way to reduce miss penalty for branches: • Take both paths simultaneously! (Parallel speculative execution.) • Fetch (or pre-fetch) both possible next instructions • Begin executing both in parallel till choice is known • May only works for constant (immed./PC-relative) branches • A branch to a computed EA may have too many destinations • May have a large penalty in energy, area, clock speed…