330 likes | 341 Views
CENG 450 Computer Systems and Architecture Lecture 11. Amirali Baniasadi amirali@ece.uvic.ca. This Lecture. Branch Prediction Multiple Issue. Branch Prediction. Predicting the outcome of a branch Direction: Taken / Not Taken Direction predictors Target Address
E N D
CENG 450Computer Systems and ArchitectureLecture 11 Amirali Baniasadi amirali@ece.uvic.ca
This Lecture • Branch Prediction • Multiple Issue
Branch Prediction • Predicting the outcome of a branch • Direction: • Taken / Not Taken • Direction predictors • Target Address • PC+offset (Taken)/ PC+4 (Not Taken) • Target address predictors • Branch Target Buffer (BTB)
Why do we need branch prediction? • Branch prediction • Increases the number of instructions available for the scheduler to issue. Increases instruction level parallelism (ILP) • Allows useful work to be completed while waiting for the branch to resolve
Branch Prediction Strategies • Static • Decided before runtime • Examples: • Always-Not Taken • Always-Taken • Backwards Taken, Forward Not Taken (BTFNT) • Profile-driven prediction • Dynamic • Prediction decisions may change during the execution of the program
What happens when a branch is predicted? • On misprediction: • No speculative state may commit • Squash instructions in the pipeline • Must not allow stores in the pipeline to occur • Cannot allow stores which would not have happened to commit • Even for good branch predictors more than half of the fetched instructions are squashed
A Generic Branch Predictor Predicted Stream PC, T or NT Fetch f(PC, x) Resolve Actual Stream f(PC, x) = T or NT Actual Stream Execution Order Predicted Stream - What’s f (PC, x)? - x can be any relevant info thus far x was empty
Bimodal Branch Predictors • Dynamically store information about the branch behaviour • Branches tend to behave in a fixed way • Branches tend to behave in the same way across program execution • Index a Pattern History Table using the branch address • 1 bit: branch behaves as it did last time • Saturating 2 bit counter: branch behaves as it usually does
Saturating-Counter Predictors • Consider strongly biased branch with infrequent outcome • TTTTTTTTNTTTTTTTTNTTTT • Last-outcome will misspredict twice per infrequent outcome encounter: • TTTTTTTTNTTTTTTTTNTTTT • Idea: Remember most frequent case • Saturating-Counter: Hysteresis • often called bi-modal predictor • Captures Temporal Bias
Bimodal Prediction • Table of 2-bit saturating counters • Predict the most common direction • Advantages: simple, cheap, “good” accuracy • Bimodal will misspredict once per infrequent outcome encounter: TTTTTTTTNTTTTTTTTNTTTT
Correlating Predictors • From program perspective: • Different Branches may be correlated • if (aa == 2) aa = 0; • if (bb == 2) bb = 0; • if (aa != bb) then … • Can be viewed as a pattern detector • Instead of keeping aggregate history information • I.e., most frequent outcome • Keep exact history information • Pattern of n most recent outcomes • Example: • BHR: n most recent branch outcomes • Use PC and BHR (xor?) to access prediction table
Pattern-based Prediction • Nested loops: for i = 0 to N for j = 0 to 3 … • Branch Outcome Stream for j-for branch • 11101110111011101110 • Patterns: • 111 -> 0 • 110 -> 1 • 101 -> 1 • 011 -> 1 • 100% accuracy • Learning time 4 instances • Table Index (PC, 3-bit history)
Two-level Branch Predictors • A branch outcome depends on the outcomes of previous branches • First level: Branch History Registers (BHR) • Global history / Branch correlation: past executions of all branches • Self history / Private history: past executions of the same branch • Second level: Pattern History Table (PHT) • Use first level information to index a table • Possibly XOR with the branch address • PHT: Usually saturating 2 bit counters • Also private, shared or global
Gshare Predictor (McFarling) Branch History Table Global BHR • PC and BHR can be • concatenated • completely overlapped • partially overlapped • xored, etc. • How deep BHR should be? • Really depends on program • But, deeper increases learning time • May increase quality of information Prediction f PC
Hybrid Prediction PC GSHARE Bimodal ... T/NT T/NT Selector T/NT • Combining branch predictors • Use two different branch predictors • Access both in parallel • A third table determines which prediction to use Two or more predictor components combined • Different branches benefit from different types of history
Issues Affecting Accurate Branch Prediction • Aliasing • More than one branch may use the same BHT/PHT entry • Constructive • Prediction that would have been incorrect, predicted correctly • Destructive • Prediction that would have been correct, predicted incorrectly • Neutral • No change in the accuracy
More Issues • Training time • Need to see enough branches to uncover pattern • Need enough time to reach steady state • “Wrong” history • Incorrect type of history for the branch • Stale state • Predictor is updated after information is needed • Operating system context switches • More aliasing caused by branches in different programs
Performance Metrics • Misprediction rate • Mispredicted branches per executed branch • Unfortunately the most usually found • Instructions per mispredicted branch • Gives a better idea of the program behaviour • Branches are not evenly spaced
Upper Limit to ILP: Ideal Machine Amount of parallelism when there are no branch mis-predictions and we’re limited only by data dependencies. FP: 75 - 150 Integer: 18 - 60 IPC Instructions that could theoretically be issued per cycle.
Impact of Realistic Branch Prediction FP: 15 - 45 • Limiting the type of branch prediction. Integer: 6 - 12 IPC
Pentium III • Dynamic branch prediction • 512-entry BTB predicts direction and target, 4-bit history used with PC to derive direction • Mispredicted: at least 9 cycles, as many as 26, average 10-15 cycles
AMD Athlon K7 • 10-stage integer, 15-stage fp pipeline, predictor accessed in fetch • 2K-entry bimodal, 2K-entry BTB • Branch Penalties: • Mispredict penalty: at least 10 cycles
Multiple Issue • Multiple Issue is the ability of the processor to start more than one instruction in a given cycle. • Superscalar processors • Very Long Instruction Word (VLIW) processors
1990’s: Superscalar Processors • Bottleneck: CPI >= 1 • Limit on scalar performance (single instruction issue) • Hazards • Superpipelining? Diminishing returns (hazards + overhead) • How can we make the CPI = 0.5? • Multiple instructions in every pipeline stage (super-scalar) • 1 2 3 4 5 6 7 • Inst0 IF ID EX MEM WB • Inst1 IF ID EX MEM WB • Inst2 IF ID EX MEM WB • Inst3 IF ID EX MEM WB • Inst4 IF ID EX MEM WB • Inst5 IF ID EX MEM WB
Superscalar Vs. VLIW • Religious debate, similar to RISC vs. CISC • Wisconsin + Michigan (Super scalar) Vs. Illinois (VLIW) • Q. Who can schedule code better, hardware or software?
Hardware Scheduling • High branch prediction accuracy • Dynamic information on latencies (cache misses) • Dynamic information on memory dependences • Easy to speculate (& recover from mis-speculation) • Works for generic, non-loop, irregular code • Ex: databases, desktop applications, compilers • Limited reorder buffer size limits “lookahead” • High cost/complexity • Slow clock
Software Scheduling • Large scheduling scope (full program), large “lookahead” • Can handle very long latencies • Simple hardware with fast clock • Only works well for “regular” codes (scientific, FORTRAN) • Low branch prediction accuracy • Can improve by profiling • No information on latencies like cache misses • Can improve by profiling • Pain to speculate and recover from mis-speculation • Can improve with hardware support
Superscalar Processors • Pioneer: IBM (America => RIOS, RS/6000, Power-1) • Superscalar instruction combinations • 1 ALU or memory or branch + 1 FP (RS/6000) • Any 1 + 1 ALU (Pentium) • Any 1 ALU or FP+ 1 ALU + 1 load + 1 store + 1 branch (Pentium II) • Impact of superscalar • More opportunity for hazards (why?) • More performance loss due to hazards (why?)
Superscalar Processors • Issues varying number of instructions per clock • Scheduling: Static (by the compiler) or dynamic(by the hardware) • Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo). • IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000
Elements of Advanced Superscalars • High performance instruction fetching • Good dynamic branch and jump prediction • Multiple instructions per cycle, multiple branches per cycle? • Scheduling and hazard elimination • Dynamic scheduling • Not necessarily: Alpha 21064 & Pentium were statically scheduled • Register renaming to eliminate WAR and WAW • Parallel functional units, paths/buses/multiple register ports • High performance memory systems • Speculative execution
SS + DS + Speculation • Superscalar + Dynamic scheduling + Speculation Three great tastes that taste great together • CPI >= 1? • Overcome with superscalar • Superscalar increases hazards • Overcome with dynamic scheduling • RAW dependences still a problem? • Overcome with a large window • Branches a problem for filling large window? • Overcome with speculation
The Big Picture issue Static program Fetch & branch predict execution & Reorder & commit
Readings • New paper on branch prediction online. READ. • Material would be used in the THIRD quiz