360 likes | 539 Views
EENG 449bG/CPSC 439bG Computer Systems Lecture 16 Instruction Level Parallelism II Dynamic Branch Prediction. March 24, 2005 Prof. Andreas Savvides Spring 2005 http://www.eng.yale.edu/courses/2005s/eeng449b. Announcements. Reading for this lecture: Chapter 3, sections 3.4 & 3.5
E N D
EENG 449bG/CPSC 439bG Computer SystemsLecture 16 Instruction Level Parallelism IIDynamic Branch Prediction March 24, 2005 Prof. Andreas Savvides Spring 2005 http://www.eng.yale.edu/courses/2005s/eeng449b
Announcements • Reading for this lecture: Chapter 3, sections 3.4 & 3.5 • Homework #2
Why do we Need Dynamic Hardware Prediction? • Basic blocks are short, and we have already optimized them with dynamic scheduling in Tomasulo’s algorithm • Now the bottleneck is control dependences • Branches disrupt sequential flow of execution • Need to find ways to avoid stalls from branches • Need to predict 2 things • Branch outcome • Branch target address (what is the next address we should execute code from?)
Static Prediction Strategies • Several static strategies can apply • Predict all branches NOT TAKEN • Predict all branges as TAKEN • Predict all branches with certain opcodes as TAKEN, and all others as NOT TAKEN • Predict all forward branches as NOT TAKEN and all backward branches as TAKEN • Opcodes have default predictions that the compiler may reverse at compile time
Dynamic Branch Prediction • Builds on the premise that history matters • Observe the behavior of branches in previous instances and try to predict future branch behavior • Try to predict the outcome of a branch early on in order to avoid stalls • Branch prediction is critical for multiple issue processors • In an n-issue processor, branches will come n times faster than a single issue processor
Branch Prediction Metrics • To evaluate the effectiveness of branch prediction you need to consider • Prediction accuracy • Penalties associated with branch taken and branch not taken • The associated penalties are artifacts of • Pipeline design • Type of predictor • Branch frequency • Strategy to deal with the misprediction
Basic Branch Predictor • Use a 1-bit branch predictor buffer or branch history table • 1 bit of memory stating whether the branch was recently taken or not • Indexed by the lower portion of the branch predict instruction • Bit entry updated each time the branch instruction is executed • Problem with 1-bit prediction • It will always give the wrong prediction twice • Imagine executing a loop • Predictor will be wrong on the first and last iteration
T NT NT T A One-Bit Predictor State 1 Predict Taken State 0 Predict Not Taken • Predictor misses twice on typical loop branches • Once at the end of loop • Once at the end of the 1st iteration of next execution of loop • The outcome sequence NT-T-NT-T makes it miss all the time
T NT NT State 3 Predict Taken State 2 PredictTaken T T NT T State 1 Predict Not Taken State 0 Predict Not Taken NT A Two-Bit Predictor • A four-state Moore machine • Predictor misses once on typical loop branches • hence popular • Outcome sequence NT-NT-T-T-NT-NT-T-T make it miss all the time
A Two-Bit Predictor • A four-state Moore machine • Predictor misses once on typical loop branches • hence popular • Input sequence NT-NT-T-T-NT-NT-T-T make it miss all the time
Branch Prediction Implementation Implications • Branch predictors held in branch predictor buffers • Implemented as small caches accessed with instruction address at the IF phase of a pipeline • OR it could be implemented as a pair of bits attached to each block in the instruction cache • This branch prediction scheme does not help in the basic 5-stage pipeline • The decision whether a branch is taken and the target address are computed at the same stage…
Prediction if Program Depended: Branch Prediction Accuracy on SPEC 89 Benchmark • Using 2-bit prediction, 4KB cache FP programs Integer programs
Performance of SPEC 98 Benchmark • Remember • To evaluate performance you need to know the branch frequencies and misprediction penalties • FP programs typically come from scientific applications and are more loop based • Branches harder to predict in integer programs • Typically have higher branch frequency • How can this be improved? • Perhaps increase the cache buffer • Increase the effectiveness of the predictor
Effects of Cache Buffer Size • Increasing branch predictor buffer Has little impact on branch prediction
Correlating Bit Predictors • Need to change predictor structure • What about considering the behavior of other branches than the ones we are trying to predict? • The branch outcome may be predicted based on the outcome of previous k branches • Goal: Use correlating or 2-level predictors to exploit the correlation between consecutive branches…
if (aa==2) aa=0; if (bb==2) bb=0; if (aa!=bb){ DSUBUI R3, R1, #2 BNEZ R3, L1 ; branch b1 DADD R1, R0, R0 L1: DSUBUI R3,R2,#2 BNEZ R3, L2 ; branch b2 DADD R2,R0,R0 L2: DSUBU R3,R1,R2 BEQZ R3, L3 ; branch b3 Branch b3 is correlated with b1 and b2 Branch Correlation Example
Consider the following code: if (d==0) d=1; if (d==1) BNEZ R1, L1 ; branch b1 DADDUI R1,R0,#1 L1: DADDUI R3,R1, #-1 BNEZ R3,L2 ; branch b2 … L2: Correlated Branch Example What are the possible execution sequences when d=0,1,2?
Consider a sequence of b=2,0,2,0 and a 1-bit predictor P=prediction, A=action, NP= new prediction P. b1 A. b1 NP. b1 P. b2 A. b2 NP. b2 d=2 NT T T NT T T d=0 T NT NT T NT NT d=2 NT T T NT T T d=0 T NT NT T NT NT BNEZ R1, L1 ; branch b1 DADDUI R1,R0,#1 L1: DADDUI R3,R1, #-1 BNEZ R3,L2 ; branch b2 … L2: Using a 1-bit Predictor
Consider a sequence of b=2,0,2,0 and a 1-bit predictor P. b1 A. b1 NP. b1 P. b2 A. b2 NP. b2 d=2 NT T T NT T T d=0 T NT NT T NT NT d=2 NT T T NT T T d=0 T NT NT T NT NT All branches are mispredicted !!! BNEZ R1, L1 ; branch b1 DADDUI R1,R0,#1 L1: DADDUI R3,R1, #-1 BNEZ R3,L2 ; branch b2 … L2: Using a 1-bit Predictor
Using a 1-bit Predictor with 1-bit Correlation X/X Prediction if last branch was taken Prediction if last branch was NOT taken NOTE: last branch refers to the preceding branch instruction not the previous execution of the current branch instruction
Consider a sequence of b=2,0,2,0 and a 1-bit predictor P. b1 A. b1 NP. b1 P. b2 A. b2 NP. b2 d=2 NT/NT T T/NT NT/NT T NT/T d=0 T/NT NT T/NT NT/T NT NT/T d=2 T/NT T T/NT NT/T T NT/T d=0 T/NT NT T/NT NT/T NT NT/T BNEZ R1, L1 ; branch b1 DADDUI R1,R0,#1 L1: DADDUI R3,R1, #-1 BNEZ R3,L2 ; branch b2 … L2: Using a 1-bit Predictor with 1-bit Correlation
Consider a sequence of b=2,0,2,0 and a 1-bit predictor P. b1 A. b1 NP. b1 P. b2 A. b2 NP. b2 d=2 NT/NT T T/NT NT/NT T NT/T d=0 T/NT NT T/NT NT/T NT NT/T d=2 T/NT T T/NT NT/T T NT/T d=0 T/NT NT T/NT NT/T NT NT/T Misprediction only on the first iteration of d=2! This is called a (1,1) predictor BNEZ R1, L1 ; branch b1 DADDUI R1,R0,#1 L1: DADDUI R3,R1, #-1 BNEZ R3,L2 ; branch b2 … L2: Using a 1-bit Predictor with 1-bit Correlation
(m,n) Predictors • Use the behavior of last m branches to choose from 2m branch predictors. Each is an n-bit predictor for a single branch Ex. A (2,2) branch predictor Why do we have 4, 2-bit values per line?
Example How many branch-selected entries are in a (2,2) predictor that has a total of 8K bits in the prediction buffer? 22 x 2 x Number of prediction entries= 8K => 1K of prediction entries selected by the branch
Tournament Predictors • N-bit predictors – use local information • (m,n) predictors – use global information • Tournament predictors • Local + global – enhanced performance • Example of tournament predictors • Multilevel branch predictors • Uses several levels of branch prediction table • Has an algorithm to select from multiple predictors • Advantage: Select the right predictor for the right branch
High Performance Instruction Delivery • What else can be done besides branch prediction? • Need to have high bandwidth instruction delivery • Modern multiple issue processors require 4-8 instructions per CPI • To achieve that we consider • Branch Target Buffers • Integrate Instruction Fetch Units • Branch Target Cache
Branch-Target Buffers (BTB) • How can we further reduce branch penalty? • We need to know what is the next instruction at the end of IF • If the instruction is a branch and we know the PC then the penalty would be zero • Branch-target-buffer – stores the predicted address for the next instruction after a branch • Advantage for a 5-stage pipeline • Know the predicted instruction address 1 cycle earlier IF stage instead of ID stage
BTB has a cache structure Represent addresses of known branches Note that only predicted taken branches need to be stored
Integrated Instruction Fetch Units • Instead of using instruction fetch as one of the pipeline phases, use a more advanced instruction fetch unit • To support the demands of multiple issue processors • Integrated IF has 3 main units • Integrated Branch Prediction • Instruction Prefetch • autonomously fetching ahead the given instructions • Instruction memory access and buffering • Tries to hide the overhead associated with fetching instructions from multiple cache lines by buffering instructions
Return Address Predictors • Predict the return address of jumps that are not known at compile time • Returns from procedure calls. • Procedures get called at different points in the code • Use a small stack of return addresses • Before a procedure is called put the return address on a stack and pop the stack on return • If the stack has enough depth – optimal prediction
Prediction Stack Performance Results based on a number of SPEC benchmarks
Recap So far we have seen • Dynamic Scheduling – reduce data dependences • Tomasulo’s algorithms • Dynamic Branch Prediction – Trying to reduce control dependences • N-bit predictors, (m,n) predictors, Tournament Predictors • Achieve and ideal CPI of 1 • Branch target buffer, integrated IF, return address prediction
Next Lecture • Multiple issue processors • Speculation • Completion of Ch. 3