680 likes | 700 Views
ELEC 669 Low Power Design Techniques Lecture 2. Amirali Baniasadi amirali@ece.uvic.ca. How to write a review?. Think Critically. What if? Next Step? Any other applications?. Branches. Instructions which can alter the flow of instruction execution in a program. F. F. F. F. D. D. D.
E N D
ELEC 669Low Power Design TechniquesLecture 2 Amirali Baniasadi amirali@ece.uvic.ca
How to write a review? • Think Critically. • What if? • Next Step? • Any other applications?
Branches • Instructions which can alter the flow of instruction execution in a program
F F F F D D D D A A A A M M M M W W W W Motivation • Pipelined execution • A new intruction enters the pipeline every cycle... • …but still takes several cycles to execute • Control flow changes • Two possible paths after a branch is fetched • Introduces pipeline "bubbles" • Branch delay slots • Prediction offers a chance to avoid this bubbles A branch is fetched But takes N cycles to execute Pipeline bubble
Techniques for handling branches • Stalling • Branch delay slots • Relies on programmer/compiler to fill • Depends on being able to find suitable instructions • Ties resolution delay to a particular pipeline
Why aren’t these techniques acceptable? • Branches are frequent - 15-25% • Today’s pipelines are deeper and wider • Higher performance penalty for stalling • Misprediction Penalty = issue width * resolution delay cycles • A lot of cycles can be wasted!!!
Branch Prediction • Predicting the outcome of a branch • Direction: • Taken / Not Taken • Direction predictors • Target Address • PC+offset (Taken)/ PC+4 (Not Taken) • Target address predictors • Branch Target Buffer (BTB)
Why do we need branch prediction? • Branch prediction • Increases the number of instructions available for the scheduler to issue. Increases instruction level parallelism (ILP) • Allows useful work to be completed while waiting for the branch to resolve
Branch Prediction Strategies • Static • Decided before runtime • Examples: • Always-Not Taken • Always-Taken • Backwards Taken, Forward Not Taken (BTFNT) • Profile-driven prediction • Dynamic • Prediction decisions may change during the execution of the program
What happens when a branch is predicted? • On misprediction: • No speculative state may commit • Squash instructions in the pipeline • Must not allow stores in the pipeline to occur • Cannot allow stores which would not have happened to commit • Even for good branch predictors more than half of the fetched instructions are squashed
Instruction traffic due to misprediction better Half of fetched instructions wasted. More Waste in Front-End.
Energy Loss due to Miss-Predictions better 21% average energy loss. More energy waste in integer benchmarks.
Simple Static Predictors • Simple heuristics • Always taken • Always not taken • Backwards taken / Forward not taken • Relies on the compiler to arrange the code following this assertion • Certain opcodes taken • Programmer provided hints • Profiling
Dynamic Hardware Predictors • Dynamic Branch Prediction is the ability of the hardware to make an educated guess about which way a branch will go - will the branch be taken or not. • The hardware can look for clues based on the instructions, or it can use past history - we will discuss both of these directions.
A Generic Branch Predictor Predicted Stream PC, T or NT Fetch f(PC, x) Resolve Actual Stream f(PC, x) = T or NT Actual Stream Execution Order Predicted Stream - What’s f (PC, x)? - x can be any relevant info thus far x was empty
Bimodal Branch Predictors • Dynamically store information about the branch behaviour • Branches tend to behave in a fixed way • Branches tend to behave in the same way across program execution • Index a Pattern History Table using the branch address • 1 bit: branch behaves as it did last time • Saturating 2 bit counter: branch behaves as it usually does
Saturating-Counter Predictors • Consider strongly biased branch with infrequent outcome • TTTTTTTTNTTTTTTTTNTTTT • Last-outcome will misspredict twice per infrequent outcome encounter: • TTTTTTTTNTTTTTTTTNTTTT • Idea: Remember most frequent case • Saturating-Counter: Hysteresis • often called bi-modal predictor • Captures Temporal Bias
Bimodal Prediction • Table of 2-bit saturating counters • Predict the most common direction • Advantages: simple, cheap, “good” accuracy • Bimodal will misspredict once per infrequent outcome encounter: TTTTTTTTNTTTTTTTTNTTTT
Correlating Predictors • From program perspective: • Different Branches may be correlated • if (aa == 2) aa = 0; • if (bb == 2) bb = 0; • if (aa != bb) then … • Can be viewed as a pattern detector • Instead of keeping aggregate history information • I.e., most frequent outcome • Keep exact history information • Pattern of n most recent outcomes • Example: • BHR: n most recent branch outcomes • Use PC and BHR (xor?) to access prediction table
Pattern-based Prediction • Nested loops: for i = 0 to N for j = 0 to 3 … • Branch Outcome Stream for j-for branch • 11101110111011101110 • Patterns: • 111 -> 0 • 110 -> 1 • 101 -> 1 • 011 -> 1 • 100% accuracy • Learning time 4 instances • Table Index (PC, 3-bit history)
Two-level Branch Predictors • A branch outcome depends on the outcomes of previous branches • First level: Branch History Registers (BHR) • Global history / Branch correlation: past executions of all branches • Self history / Private history: past executions of the same branch • Second level: Pattern History Table (PHT) • Use first level information to index a table • Possibly XOR with the branch address • PHT: Usually saturating 2 bit counters • Also private, shared or global
Gshare Predictor (McFarling) Branch History Table • PC and BHR can be • concatenated • completely overlapped • partially overlapped • xored, etc. • How deep BHR should be? • Really depends on program • But, deeper increases learning time • May increase quality of information Global BHR Prediction f PC
PC GSHARE Bimodal ... T/NT T/NT Selector T/NT Hybrid Prediction • Combining branch predictors • Use two different branch predictors • Access both in parallel • A third table determines which prediction to use Two or more predictor components combined • Different branches benefit from different types of history
Issues Affecting Accurate Branch Prediction • Aliasing • More than one branch may use the same BHT/PHT entry • Constructive • Prediction that would have been incorrect, predicted correctly • Destructive • Prediction that would have been correct, predicted incorrectly • Neutral • No change in the accuracy
More Issues • Training time • Need to see enough branches to uncover pattern • Need enough time to reach steady state • “Wrong” history • Incorrect type of history for the branch • Stale state • Predictor is updated after information is needed • Operating system context switches • More aliasing caused by branches in different programs
Performance Metrics • Misprediction rate • Mispredicted branches per executed branch • Unfortunately the most usually found • Instructions per mispredicted branch • Gives a better idea of the program behaviour • Branches are not evenly spaced
Impact of Realistic Branch Prediction • Limiting the type of branch prediction. FP: 15 - 45 Integer: 6 - 12 IPC
BPP:Power-Aware Branch Predictor • Combined Predictors • Branch Instruction Behavior • BPP (Branch Predictor Prediction) • Results
Bimodal Selector Gshare Combined Predictors • Different Behaviors, Different Sub-Predictors • Selector Picks Sub-Predictor. • Improved Performance over processors using only one sub-predictor • Consequence: Extra Power (~%50)
Branch Predictors & Power • Direct Effect Up to 10%. • In-direct Effect: Wrong Path Instructions: • Smaller/Less Complex Predictors, More Wasted Energy. • Power-Aware Predictors MUSTbe Highly Accurate.
Branch Instruction Behavior • Branches use the same sub-predictor:
Branch PC HINT Branch Predictor Prediction BPP BUFFER HINTS Hints on next two branches. HOW? 11: Miss-Predicted Branch 00:Branch used Bimod last time 01:Branch used Gshare last time
Branch PC HINT 0 0 0 0 1 1 0 0 0 1 1 1 0 1 0 1 BMD NON-BRANCH GSH MISS-PREDICTED BPP : example Code Sequence :First Appearance A BPP BUFFER HINTS B C A B C D D E F
Code Sequence :second appearance A BPP BUFFER A B C D B C NEXT CYCLE : Gate Selector and Bimod DO NOTHING Branch PC HINT D NEXT CYCLE: Gate Selector and Gshare E 0 0 0 0 1 1 0 0 0 1 1 1 0 1 0 1 F BRANCH NON-BRANCH BPP : example
Results • Power (Total & Branch Predictor’s) and Performance. • Compared to three base cases: • A) Non-Gated Combined (CMB) • B) Bimodal (BMD) • C) Gshare (GSH) • Reported for 32k entry Banked Predictors.
Performance Within 0.4% of CMB, better than BMD(7%) and GSH(3%)
Branch Predictor’s Energy 13% less than CMB, more than BMD(35%) and GSH(22%)
Total Energy 0.3%, 4.5% and 1.8% less than CMB, BMD and GSH
ILP, benefits and costs? • How can we extract more ILP? • What are the costs?
Upper Limit to ILP: Ideal Machine Amount of parallelism when there are no branch mis-predictions and we’re limited only by data dependencies. FP: 75 - 150 Integer: 18 - 60 IPC Instructions that could theoretically be issued per cycle.
Complexity-Effective Designs • History: “Brainiacs” and “Speed demons” • Brainiacs – maximizing the # of instructions issued per clock cycle • Speed demons – simpler implementation with a very fast clock • Complexity-Effective • Complexity-Effective architecture means that the architecture takes both of the benefits of complex issue schemes and the benefits of simpler implementation with a fast clock cycle • Complexity measurement : delay of the critical path • Proposed Architecture • High performance(high IPC) with a very high clock frequency
Extracting More Parallelism 8 8 4 Future? Today 128 256 Higher IPC Clock, Power? Want: High IPC+ Fast Clock+ Low Power
Generic pipeline description • Baseline superscalar model • Criteria for sources of complexity(delay) • structures whose delay is a function of issue window size and issue width • structures which tends to rely on broadcast operations over long wires
Sources of complexity • Register renaming logic • translates logical register designators to physical register designator • Wakeup logic • Responsible for waking up instructions waiting for their source operands to become available • Selection logic • Responsible for selection instructions for execution from the pool of ready instructions • Bypass logic • Bypassing the operand values from instructions that have completed execution • Other structures not to be considered here • Access time of the register file varies with the # of registers and the # of ports. • Access time of a cache is a function of the size of the cache and the associativity of the cache
Delay analysis for rename logic • Delay analysis for RAM scheme • RAM scheme operates like a standard RAM • Issue width affect delay through its impact wire lengths - Increasing issue width increases the # of bit/word lines - Delay of rename logic depends on the linear function of the issue width. • Spice result • Total delay & each component delay increase linearly with IW • Bit line & word line delay worsens as the feature size is reduced. (Logic delay is reduced linearly as the feature size is reduced. But wire delay fall at a slow rate.)