230 likes | 256 Views
Explore the limits of Instruction Level Parallelism (ILP) with advanced branch prediction methods, from 1-bit prediction to correlating predictors, local/global predictors, and tournament predictors. Understand the challenges and advantages of different prediction schemes.
E N D
Lecture 8: Instruction Fetch, ILP Limits • Today: advanced branch prediction, limits of ILP • (Sections 3.4-3.5, 3.8-3.14)
1-Bit Prediction • For each branch, keep track of what happened last time • and use that outcome as the prediction • What are prediction accuracies for branches 1 and 2 below: • while (1) { • for (i=0;i<10;i++) { branch-1 • … • } • for (j=0;j<20;j++) { branch-2 • … • } • }
2-Bit Prediction • For each branch, maintain a 2-bit saturating counter: • if the branch is taken: counter = min(3,counter+1) • if the branch is not taken: counter = max(0,counter-1) • If (counter >= 2), predict taken, else predict not taken • Advantage: a few atypical branches will not influence the • prediction (a better measure of “the common case”) • Especially useful when multiple branches share the same • counter (some bits of the branch PC are used to index • into the branch predictor) • Can be easily extended to N-bits (in most processors, N=2)
Correlating Predictors • Basic branch prediction: maintain a 2-bit saturating • counter for each entry (or use 10 branch PC bits to index • into one of 1024 counters) – captures the recent • “common case” for each branch • Can we take advantage of additional information? • If a branch recently went 01111, expect 0; if it recently went 11101, expect 1; can we have a separate counter for each case? • If the previous branches went 01, expect 0; if the previous branches went 11, expect 1; can we have a separate counter for each case? Hence, build correlating predictors
Local/Global Predictors • Instead of maintaining a counter for each branch to • capture the common case, • Maintain a counter for each branch and surrounding pattern • If the surrounding pattern belongs to the branch being predicted, the predictor is referred to as a local predictor • If the surrounding pattern includes neighboring branches, the predictor is referred to as a global predictor
Global Predictor A single register that keeps track of recent history for all branches Table of 16K entries of 2-bit saturating counters 00110101 8 bits 6 bits Branch PC Also referred to as a two-level predictor
Local Predictor Also a two-level predictor that only uses local histories at the first level Branch PC Table of 16K entries of 2-bit saturating counters Use 6 bits of branch PC to index into local history table 10110111011001 14-bit history indexes into next level Table of 64 entries of 14-bit histories for a single branch
Tournament Predictors • A local predictor might work well for some branches or • programs, while a global predictor might work well for others • Provide one of each and maintain another predictor to • identify which predictor is best for each branch Alpha 21264: 1K entries in level-1 1K entries in level-2 4K entries 12-bit global history 4K entries Total capacity: ? Local Predictor M U X Global Predictor Branch PC Tournament Predictor Table of 2-bit saturating counters
Predictor Comparison • Note that predictors of equal capacity must be compared • Sizes of each level have to be selected to optimize prediction accuracy • Influencing factors: degree of interference between branches, program • likely to benefit from local/global history
Branch Target Prediction • In addition to predicting the branch direction, we must • also predict the branch target address • Branch PC indexes into a predictor table; indirect branches • might be problematic • Most common indirect branch: return from a procedure – • can be easily handled with a stack of return addresses
Multiple Instruction Issue • The out-of-order processor implementation can be easily • extended to have multiple instructions in each pipeline stage • Increased complexity (lower clock speed!): • more reads and writes per cycle to register map table • more read and write ports in issue queue • more tags being broadcast to issue queue every cycle • higher complexity for bypassing/forwarding among FUs • more register read and write ports • more ports in the LSQ • more ports in the data cache • more ports in the ROB
ILP Limits • The perfect processor: • Infinite registers (no WAW or WAR hazards) • Perfect branch direction and target prediction • Perfect memory disambiguation • Perfect instruction and data caches • Single-cycle latencies for all ALUs • Infinite ROB size (window of in-flight instructions) • No limit on number of instructions in each pipeline stage • The last instruction may be scheduled in the first cycle • The only constraint is a true dependence (register or • memory RAW hazards) (with value prediction, how would • the perfect processor behave?)
Effect of Window Size • Window size is effected by register file/ROB size, branch mispredict rate, • fetch bandwidth, etc. • We will use a window size of 2K instrs and a max issue rate of 64 for • subsequent experiments
Imperfect Branch Prediction • Note: no branch mispredict penalty; branch mispredict restricts window size • Assume a large tournament predictor for subsequent experiments
Effect of Name Dependences • More registers fewer WAR and WAW constraints (usually register file size • goes hand in hand with in-flight window size) • 256 int and fp registers for subsequent experiments
Limits of ILP – Summary • Int programs are more limited by branches, memory • disambiguation, etc., while FP programs are limited most • by window size • We have not yet examined the effect of branch mispredict • penalty and imperfect caching • All of the studied factors have relatively comparable • influence on CPI: window/register size, branch prediction, • memory disambiguation • Can we do better? Yes: better compilers, value prediction, • memory dependence prediction, multi-path execution
Pentium III (P6 Microarchitecture) Case Study • 14-stage pipeline: 8 for fetch/decode/dispatch, 3+ for o-o-o, • 3 for commit branch mispredict penalty of 10-15 cycles • Out-of-order execution with a 40-entry ROB (40 temporary • or virtual registers) and 20 reservation stations • Each x86 instruction gets converted into RISC-like • micro-ops – on average, one CISC instr 1.37 micro-ops • Three instructions in each pipeline stage 3 instructions • can simultaneously leave the pipeline ideal CPmI = 0.33 • ideal CPI = 0.45
Branch Prediction • 512-entry global two-level branch predictor and 512-entry • BTB 20% combined mispredict rate • For every instruction committed, 0.2 instructions on the • mispredicted path are also executed (wasted power!) • Mispredict penalty is 10-15 cycles
Where is Time Lost? • Branch mispredict stalls • Cache miss stalls (dominated by L1D misses) • Instruction fetch stalls (happens often because subsequent • stages are stalled, and occasionally because of an I-cache • miss
CPI Performance • Owing to stalls, the processor can fall behind (no instructions are committed • for 55% of all cycles), but then recover with multi-instruction commits (31% of • all cycles) average CPI = 1.15 (Int) and 2.0 (FP) • Overlap of different stalls CPI is not the sum of individual stalls • IPC is also an attractive metric
Title • Bullet