350 likes | 526 Views
CSE 420/598 Computer Architecture Lec 9 – Chapter 2 - Branch Prediction . Sandeep K. S. Gupta School of Computing and Informatics Arizona State University. Based on Slides by David Patterson, Al Davis, and Luddy Harrison. Agenda. Dynamic Branch Prediction 1-Bit Predictor 2-Bit Predictor
E N D
CSE 420/598 Computer Architecture Lec 9 – Chapter 2 - Branch Prediction Sandeep K. S. Gupta School of Computing and Informatics Arizona State University Based on Slides by David Patterson, Al Davis, and Luddy Harrison
Agenda • Dynamic Branch Prediction • 1-Bit Predictor • 2-Bit Predictor • Correlating Predictor • Tournament Predictor • Programming Assignment 1: Case Study 2 on pg 149 – Modeling a Branch Predictor in C or JAVA. CSE420/598
Integer Floating Point Need for Better than Static Branch Prediction Techniques CSE420/598
Dynamic Branch Prediction • Why does prediction work? • Underlying algorithm has regularities • Data that is being operated on has regularities • Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems • Is dynamic branch prediction better than static branch prediction? • Seems to be • There are a small number of important branches in programs which have dynamic behavior CSE420/598
Control Hazard (Recap) • In the 5-stage in-order processor: assume always taken or assume always not taken; if the branch goes the other way, squash mis-fetched instructions • Modern out-of-order processors: dynamic branch prediction • Branch predictor: a cache of recent branch outcomes CSE420/598
Pipeline without Branch Predictor PC IF (br) Reg Read Compare Br-target PC + 4 In the 5-stage pipeline, a branch completes in two cycles If the branch went the wrong way, one incorrect instr is fetched One stall cycle per incorrect branch CSE420/598
Pipeline with Branch Predictor PC IF (br) Reg Read Compare Br-target Branch Predictor In the 5-stage pipeline, a branch completes in two cycles If the branch went the wrong way, one incorrect instr is fetched One stall cycle per incorrect branch CSE420/598
Branch Mispredict Penalty • Performance = ƒ(accuracy, cost of misprediction) • Assume: no data or structural hazards; only control hazards; every 5th instruction is a branch; branch predictor accuracy is 90% • Slowdown = 1 / (1 + stalls per instruction) • Stalls per instruction = % branches x %mispreds x penalty = 20% x 10% x 1 = 0.02 • Slowdown = 1/1.02 ; if penalty = 20, slowdown = 1/1.4 CSE420/598
Dynamic Branch Prediction – 1 Bit Prediction • Branch History Table (BHT): Lower bits of PC address index table of 1-bit values • Says whether or not branch taken last time • No address check • For each branch, keep track of what happened last time and use that outcome as the prediction CSE420/598
1-bit BHT a.k.a Branch Prediction Buffer (BPB) Predict:If BPB entry is 0, fetch PC+1If BPB entry is 1, fetch L Update:If branch is taken, BPB := 1If branch is not taken, BPB := 0 CSE420/598
State Diagram of 1-bit Predictor CSE420/598
Twice Mispredicted Loop Branches M: ADD R1, R2, R3 L: ADD R4, R5, R6 MUL R7, R8, R9 SUB R11, R11, #1BNE L SUB R10, R10, #1 BNE M CSE420/598
Sequence of Predictions CSE420/598
Problem with 1-bit BHT • What are prediction accuracies for branches 1 and 2 ? while (1) { for (i=0;i<10;i++) { branch-1 … } for (j=0;j<20;j++) { branch-2 … }} • Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iterations before exit): • End of loop case, when it exits instead of looping as before • First time through loop and on next time through code, when it predicts exit instead of looping CSE420/598
2-Bit Prediction • For each branch, maintain a 2-bit saturating counter: • if the branch is taken: counter = min(3,counter+1) • if the branch is not taken: counter = max(0,counter-1) • If (counter >= 2), predict taken, else predict not taken • Advantage: a few atypical branches will not influence the prediction (a better measure of “the common case”) • Especially useful when multiple branches share the same counter (some bits of the branch PC are used to index into the branch predictor) • Can be easily extended to N-bits (in most processors, N=2) CSE420/598
T Predict Taken Predict Taken T NT NT NT Predict Not Taken Predict Not Taken T T NT Dynamic Branch Prediction • Solution: 2-bit scheme where change prediction only if get misprediction twice in a row • Red: stop, not taken • Green: go, taken • Adds hysteresis to decision making process CSE420/598
Bimodal Predictor Table of 16K entries of 2-bit saturating counters 14 bits Branch PC CSE420/598
BHT Accuracy • Mispredict because either: • Wrong guess for that branch • Got branch history of wrong branch when index the table • 4096 entry table: Integer CSE420/598 Floating Point
Correlating Predictors • Basic branch prediction: maintain a 2-bit saturating counter for each entry (or use 10 branch PC bits to index into one of 1024 counters) – captures the recent “common case” for each branch • Can we take advantage of additional information? • If a branch recently went 01111, expect 0; if it recently went 11101, expect 1; can we have a separate counter for each case? • If the previous branches went 01, expect 0; if the previous branches went 11, expect 1; can we have a separate counter for each case? • Hence, build correlating predictors CSE420/598
Local/Global Predictors • Instead of maintaining a counter for each branch to capture the common case, • Maintain a counter for each branch and surrounding pattern • If the surrounding pattern belongs to the branch being predicted, the predictor is referred to as a local predictor • If the surrounding pattern includes neighboring branches, the predictor is referred to as a global predictor CSE420/598
Global Predictor A single register that keeps track of recent history for all branches Table of 16K entries of 2-bit saturating counters 00110101 8 bits 6 bits Branch PC Also referred to as a two-level predictor CSE420/598
Local Predictor Also a two-level predictor that only uses local histories at the first level Branch PC Table of 16K entries of 2-bit saturating counters Use 6 bits of branch PC to index into local history table 10110111011001 14-bit history indexes into next level Table of 64 entries of 14-bit histories for a single branch CSE420/598
Correlated Branch Prediction • Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper n-bit branch history table • In general, (m,n) predictor means record last m branches to select between 2m history tables, each with n-bit counters • Thus, old 2-bit BHT is a (0,2) predictor • Global Branch History: m-bit shift register keeping T/NT status of last m branches. • Each entry in table has mn-bit predictors. CSE420/598
Correlating Branches • (2,2) predictor • – Behavior of recent branches selects between four predictions of next branch, updating just that prediction Branch address 4 2-bits per branch predictor Prediction 2-bit global branch history CSE420/598
Accuracy of Different Schemes 20% 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 18% 16% 14% 12% 11% Frequency of Mispredictions 10% 8% 6% 6% 6% 6% 5% 5% 4% 4% 2% 1% 1% 0% 0% nasa7 matrix300 tomcatv doducd spice fpppp gcc expresso eqntott li 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2) CSE420/598
Tournament Predictors • A local predictor might work well for some branches or • programs, while a global predictor might work well for others • Provide one of each and maintain another predictor to • identify which predictor is best for each branch Local Predictor M U X Global Predictor Branch PC Tournament Predictor Table of 2-bit saturating counters CSE420/598
Global Predictor – Example What is the total capacity of this branch predictor? A single register that keeps track of recent history for all branches Table of 2-bit saturating counters 00110101 10 bits 4 bits Branch PC Also referred to as a two-level predictor CSE420/598
Local Predictor – Example What is the total capacity of this branch predictor? Branch PC Table of 2-bit saturating counters Use 8 bits of branch PC to index into local history table 10110111 Table of 8-bit histories for a single branch CSE420/598
Example • Consider the following tournament branch predictor: Fourteen bits of • the PC are used to index into a table of 3-bit saturating counters that • predict whether we should use a local or global prediction. The global • predictor concatenates 8 bits of branch PC and 6 bits of global history • to index into 2-bit saturating counters. The local predictor uses 8 bits • of branch PC to select an 8-bit local history that then indexes into a • table of 2-bit saturating counters. What is the capacity of each • structure in this branch predictor? CSE420/598
Tournament Predictors • Multilevel branch predictor • Use n-bit saturating counter to choose between predictors • Usual choice between global and local predictors CSE420/598
Tournament Predictors Tournament predictor using, say, 4K 2-bit counters indexed by local branch address. Chooses between: • Global predictor • 4K entries index by history of last 12 branches (212 = 4K) • Each entry is a standard 2-bit predictor • Local predictor • Local history table: 1024 10-bit entries recording last 10 branches, index by branch address • The pattern of the last 10 occurrences of that particular branch used to index table of 1K entries with 3-bit saturating counters CSE420/598
Comparing Predictors (Fig. 2.8) • Advantage of tournament predictor is ability to select the right predictor for a particular branch • Particularly crucial for integer benchmarks. • A typical tournament predictor will select the global predictor almost 40% of the time for the SPEC integer benchmarks and less than 15% of the time for the SPEC FP benchmarks CSE420/598
Pentium 4 Misprediction Rate (per 1000 instructions, not per branch) 6% misprediction rate per branch SPECint (19% of INT instructions are branch) 2% misprediction rate per branch SPECfp(5% of FP instructions are branch) SPECint2000 SPECfp2000 CSE420/598
Branch Target Prediction • In addition to predicting the branch direction, we must • also predict the branch target address • Branch PC indexes into a predictor table; indirect branches • might be problematic • Most common indirect branch: return from a procedure – • can be easily handled with a stack of return addresses CSE420/598
Summary • When comparing Branch predictors – ensure that they are of same “size”. • Correlating predictor’s predict branch direction based on behavior of neighboring branches • Tournament predictors select between global and local predictors • Integer benchmarks benefit greatly from global and correlating predictors • Next class BTB, Dynamic Scheduling of Instructions. CSE420/598