1 / 68

ELEC 669 Low Power Design Techniques Lecture 2

ELEC 669 Low Power Design Techniques Lecture 2. Amirali Baniasadi amirali@ece.uvic.ca. How to write a review?. Think Critically. What if? Next Step? Any other applications?. Branches. Instructions which can alter the flow of instruction execution in a program. F. F. F. F. D. D. D.

milesd
Download Presentation

ELEC 669 Low Power Design Techniques Lecture 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ELEC 669Low Power Design TechniquesLecture 2 Amirali Baniasadi amirali@ece.uvic.ca

  2. How to write a review? • Think Critically. • What if? • Next Step? • Any other applications?

  3. Branches • Instructions which can alter the flow of instruction execution in a program

  4. F F F F D D D D A A A A M M M M W W W W Motivation • Pipelined execution • A new intruction enters the pipeline every cycle... • …but still takes several cycles to execute • Control flow changes • Two possible paths after a branch is fetched • Introduces pipeline "bubbles" • Branch delay slots • Prediction offers a chance to avoid this bubbles A branch is fetched But takes N cycles to execute Pipeline bubble

  5. Techniques for handling branches • Stalling • Branch delay slots • Relies on programmer/compiler to fill • Depends on being able to find suitable instructions • Ties resolution delay to a particular pipeline

  6. Why aren’t these techniques acceptable? • Branches are frequent - 15-25% • Today’s pipelines are deeper and wider • Higher performance penalty for stalling • Misprediction Penalty = issue width * resolution delay cycles • A lot of cycles can be wasted!!!

  7. Branch Prediction • Predicting the outcome of a branch • Direction: • Taken / Not Taken • Direction predictors • Target Address • PC+offset (Taken)/ PC+4 (Not Taken) • Target address predictors • Branch Target Buffer (BTB)

  8. Why do we need branch prediction? • Branch prediction • Increases the number of instructions available for the scheduler to issue. Increases instruction level parallelism (ILP) • Allows useful work to be completed while waiting for the branch to resolve

  9. Branch Prediction Strategies • Static • Decided before runtime • Examples: • Always-Not Taken • Always-Taken • Backwards Taken, Forward Not Taken (BTFNT) • Profile-driven prediction • Dynamic • Prediction decisions may change during the execution of the program

  10. What happens when a branch is predicted? • On misprediction: • No speculative state may commit • Squash instructions in the pipeline • Must not allow stores in the pipeline to occur • Cannot allow stores which would not have happened to commit • Even for good branch predictors more than half of the fetched instructions are squashed

  11. Instruction traffic due to misprediction better Half of fetched instructions wasted. More Waste in Front-End.

  12. Energy Loss due to Miss-Predictions better 21% average energy loss. More energy waste in integer benchmarks.

  13. Simple Static Predictors • Simple heuristics • Always taken • Always not taken • Backwards taken / Forward not taken • Relies on the compiler to arrange the code following this assertion • Certain opcodes taken • Programmer provided hints • Profiling

  14. Simple Static Predictors

  15. Dynamic Hardware Predictors • Dynamic Branch Prediction is the ability of the hardware to make an educated guess about which way a branch will go - will the branch be taken or not. • The hardware can look for clues based on the instructions, or it can use past history - we will discuss both of these directions.

  16. A Generic Branch Predictor Predicted Stream PC, T or NT Fetch f(PC, x) Resolve Actual Stream f(PC, x) = T or NT Actual Stream Execution Order Predicted Stream - What’s f (PC, x)? - x can be any relevant info thus far x was empty

  17. Bimodal Branch Predictors • Dynamically store information about the branch behaviour • Branches tend to behave in a fixed way • Branches tend to behave in the same way across program execution • Index a Pattern History Table using the branch address • 1 bit: branch behaves as it did last time • Saturating 2 bit counter: branch behaves as it usually does

  18. Saturating-Counter Predictors • Consider strongly biased branch with infrequent outcome • TTTTTTTTNTTTTTTTTNTTTT • Last-outcome will misspredict twice per infrequent outcome encounter: • TTTTTTTTNTTTTTTTTNTTTT • Idea: Remember most frequent case • Saturating-Counter: Hysteresis • often called bi-modal predictor • Captures Temporal Bias

  19. Bimodal Prediction • Table of 2-bit saturating counters • Predict the most common direction • Advantages: simple, cheap, “good” accuracy • Bimodal will misspredict once per infrequent outcome encounter: TTTTTTTTNTTTTTTTTNTTTT

  20. Bimodal Branch Predictors

  21. Correlating Predictors • From program perspective: • Different Branches may be correlated • if (aa == 2) aa = 0; • if (bb == 2) bb = 0; • if (aa != bb) then … • Can be viewed as a pattern detector • Instead of keeping aggregate history information • I.e., most frequent outcome • Keep exact history information • Pattern of n most recent outcomes • Example: • BHR: n most recent branch outcomes • Use PC and BHR (xor?) to access prediction table

  22. Pattern-based Prediction • Nested loops: for i = 0 to N for j = 0 to 3 … • Branch Outcome Stream for j-for branch • 11101110111011101110 • Patterns: • 111 -> 0 • 110 -> 1 • 101 -> 1 • 011 -> 1 • 100% accuracy • Learning time 4 instances • Table Index (PC, 3-bit history)

  23. Two-level Branch Predictors • A branch outcome depends on the outcomes of previous branches • First level: Branch History Registers (BHR) • Global history / Branch correlation: past executions of all branches • Self history / Private history: past executions of the same branch • Second level: Pattern History Table (PHT) • Use first level information to index a table • Possibly XOR with the branch address • PHT: Usually saturating 2 bit counters • Also private, shared or global

  24. Gshare Predictor (McFarling) Branch History Table • PC and BHR can be • concatenated • completely overlapped • partially overlapped • xored, etc. • How deep BHR should be? • Really depends on program • But, deeper increases learning time • May increase quality of information Global BHR Prediction f PC

  25. Two-level Branch Predictors (II)

  26. PC GSHARE Bimodal ... T/NT T/NT Selector T/NT Hybrid Prediction • Combining branch predictors • Use two different branch predictors • Access both in parallel • A third table determines which prediction to use Two or more predictor components combined • Different branches benefit from different types of history

  27. Hybrid Branch Predictors (II)

  28. Issues Affecting Accurate Branch Prediction • Aliasing • More than one branch may use the same BHT/PHT entry • Constructive • Prediction that would have been incorrect, predicted correctly • Destructive • Prediction that would have been correct, predicted incorrectly • Neutral • No change in the accuracy

  29. More Issues • Training time • Need to see enough branches to uncover pattern • Need enough time to reach steady state • “Wrong” history • Incorrect type of history for the branch • Stale state • Predictor is updated after information is needed • Operating system context switches • More aliasing caused by branches in different programs

  30. Performance Metrics • Misprediction rate • Mispredicted branches per executed branch • Unfortunately the most usually found • Instructions per mispredicted branch • Gives a better idea of the program behaviour • Branches are not evenly spaced

  31. Impact of Realistic Branch Prediction • Limiting the type of branch prediction. FP: 15 - 45 Integer: 6 - 12 IPC

  32. BPP:Power-Aware Branch Predictor • Combined Predictors • Branch Instruction Behavior • BPP (Branch Predictor Prediction) • Results

  33. Bimodal Selector Gshare Combined Predictors • Different Behaviors, Different Sub-Predictors • Selector Picks Sub-Predictor. • Improved Performance over processors using only one sub-predictor • Consequence: Extra Power (~%50)

  34. Branch Predictors & Power • Direct Effect Up to 10%. • In-direct Effect: Wrong Path Instructions: • Smaller/Less Complex Predictors, More Wasted Energy. • Power-Aware Predictors MUSTbe Highly Accurate.

  35. Branch Instruction Behavior • Branches use the same sub-predictor:

  36. Branch PC HINT Branch Predictor Prediction BPP BUFFER HINTS Hints on next two branches. HOW? 11: Miss-Predicted Branch 00:Branch used Bimod last time 01:Branch used Gshare last time

  37. Branch PC HINT 0 0 0 0 1 1 0 0 0 1 1 1 0 1 0 1 BMD NON-BRANCH GSH MISS-PREDICTED BPP : example Code Sequence :First Appearance A BPP BUFFER HINTS B C A B C D D E F

  38. Code Sequence :second appearance A BPP BUFFER A B C D B C NEXT CYCLE : Gate Selector and Bimod DO NOTHING Branch PC HINT D NEXT CYCLE: Gate Selector and Gshare E 0 0 0 0 1 1 0 0 0 1 1 1 0 1 0 1 F BRANCH NON-BRANCH BPP : example

  39. Results • Power (Total & Branch Predictor’s) and Performance. • Compared to three base cases: • A) Non-Gated Combined (CMB) • B) Bimodal (BMD) • C) Gshare (GSH) • Reported for 32k entry Banked Predictors.

  40. Performance Within 0.4% of CMB, better than BMD(7%) and GSH(3%)

  41. Branch Predictor’s Energy 13% less than CMB, more than BMD(35%) and GSH(22%)

  42. Total Energy 0.3%, 4.5% and 1.8% less than CMB, BMD and GSH

  43. ILP, benefits and costs? • How can we extract more ILP? • What are the costs?

  44. Upper Limit to ILP: Ideal Machine Amount of parallelism when there are no branch mis-predictions and we’re limited only by data dependencies. FP: 75 - 150 Integer: 18 - 60 IPC Instructions that could theoretically be issued per cycle.

  45. Complexity-Effective Designs • History: “Brainiacs” and “Speed demons” • Brainiacs – maximizing the # of instructions issued per clock cycle • Speed demons – simpler implementation with a very fast clock • Complexity-Effective • Complexity-Effective architecture means that the architecture takes both of the benefits of complex issue schemes and the benefits of simpler implementation with a fast clock cycle • Complexity measurement : delay of the critical path • Proposed Architecture • High performance(high IPC) with a very high clock frequency

  46. Extracting More Parallelism 8 8 4 Future? Today 128 256 Higher IPC Clock, Power? Want: High IPC+ Fast Clock+ Low Power

  47. Generic pipeline description • Baseline superscalar model • Criteria for sources of complexity(delay) • structures whose delay is a function of issue window size and issue width • structures which tends to rely on broadcast operations over long wires

  48. Sources of complexity • Register renaming logic • translates logical register designators to physical register designator • Wakeup logic • Responsible for waking up instructions waiting for their source operands to become available • Selection logic • Responsible for selection instructions for execution from the pool of ready instructions • Bypass logic • Bypassing the operand values from instructions that have completed execution • Other structures not to be considered here • Access time of the register file varies with the # of registers and the # of ports. • Access time of a cache is a function of the size of the cache and the associativity of the cache

  49. Register rename logic complexity

  50. Delay analysis for rename logic • Delay analysis for RAM scheme • RAM scheme operates like a standard RAM • Issue width affect delay through its impact wire lengths - Increasing issue width increases the # of bit/word lines - Delay of rename logic depends on the linear function of the issue width. • Spice result • Total delay & each component delay increase linearly with IW • Bit line & word line delay worsens as the feature size is reduced. (Logic delay is reduced linearly as the feature size is reduced. But wire delay fall at a slow rate.)

More Related