600 likes | 612 Views
CENG 450 Computer Systems and Architecture Lecture 10. Amirali Baniasadi amirali@ece.uvic.ca. This Lecture. Tomasulo Branch Prediction. Tomasulo’s Algorithm. Developed for IBM 360/91 ~3 years after CDC 6600 (1966) Goal: High Performance without special compilers
E N D
CENG 450Computer Systems and ArchitectureLecture 10 Amirali Baniasadi amirali@ece.uvic.ca
This Lecture • Tomasulo • Branch Prediction
Tomasulo’s Algorithm • Developed for IBM 360/91 ~3 years after CDC 6600 (1966) • Goal: High Performance without special compilers • Differences between IBM 360 & CDC 6600 ISA • IBM has only 2 register specifiers per instruction vs. 3 in CDC 6600 • IBM has 4 FP registers vs. 8 in CDC 6600 • IBM has long memory access delays, long FP delays • Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, …
Tomasulo’s Algorithm • Avoid RAW Hazards • Execute an instruction only when its operands are available • Has a scheme to track when operands are available • Avoid WAR and WAW Hazards • Support Register renaming. • Renames all destination registers: Out-of-order write does not affect any instructions that depend on an earlier value of an operand • DIVD F0, F2, F4 DIVD F0, F2, F4 • ADDD F6, F0, F8 ADDD S, F0, F8 //S & T temp Reg • SD F6, 0(R1) SD S, 0(R1) • SUBD F8, F10, F14 SUBD T, F10, F14 • MULD F6, F10, F8 MULD F6, F10, T • Supports the overlapped execution of multiple iterations of a loop WAR WAW
Tomasulo Algorithm vs. Scoreboard • Control & buffers distributed with Function Units (FU) vs. centralized in scoreboard; (with bypassing) • FU buffers are called reservation stations; have pending operands • Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming • avoids WAR, WAW hazards • More reservation stations than registers, so can do optimizations compilers can’t • Results to FU from RS, not through registers, over Common Data Busthat broadcasts results to all Fus • Load and Stores treated as FUs with reservation stations as well
Three Stages of Tomasulo’s Algorithm • Issue—get instruction from FP Op Queue • If reservation station free (no structural hazard), issue instruction & operand values (if they are in the registers). • If reservation station is busy, instruction stalls • If operands are not in the registers – rename registers (eliminate WAR, WAW hazards) and keep track of functional units producing operands • Execution—operate on operands (EX) • If both operands ready then execute; • if not ready, watch Common Data Bus for result (Avoid RAW hazard) • Write result—finish execution (WB) • Write on Common Data Bus to all units; mark reservation station available • Normal data bus: data + destination (“go to” bus) • Common data bus: data + source (“come from” bus); Broadcasts Each stage can take different number of clock cycles
Reservation Station Components • Op • Operation to perform in the unit (e.g., + or –) • Vj, Vk • Value of Source operands • Store buffers have V field with result to be stored • Qj, Qk • Reservation stations producing source operand (Qj,Qk=0 => ready) • Busy • Indicates reservation station or FU is busy • Qi:Register result status • Indicates which functional unit (if exists) will write to the register. • ‘0’ when no pending instructions to write to this register.
Example LD F6, 34(R2) LD F2, 45(R3) MULT F0, F2, F4 SUBD F8, F6, F2 DIVD F10, F0, F6 ADDD F6, F8, F2 Latencies (clock cycles) LD 1 MULT 10 DIVD 40 ADDD, SUBD 2
Review: Tomasulo • Prevents Register as bottleneck • Avoids WAR, WAW hazards of Scoreboard • Allows loop unrolling in HW • Not limited to basic blocks (provides branch prediction) • Lasting Contributions • Dynamic scheduling • Register renaming • Load/store disambiguation • 360/91 descendants are PowerPC 604, 620; MIPS R10000; HP-PA 8000; Intel Pentium Pro
Example of WAR hazardsin Tomasulo’s Algorithm Example: LF F6, 34(R2) DIVF F10, F6, F0 ADDF F6, F8, F2 • ADDF can safely finish before DIVF has read register F6 because: • DIVF has renamed register F6 to point at LFs functional unit • LF broadcasts its result on the Common Data Bus
Register Renaming • Register renaming • Change register names to eliminate WAR/WAW hazards • Hardware renaming: most beautiful thing in architecture • Key: think of architectural registers as names, not locations • Can have more locations than names • Dynamically map names to locations • Map table: hardware structure holds current mappings • Writes allocate new location, note in map table • Reads find location of most recent write by looking at map table • Must de-allocate locations appropriately
Tomasulo Register Renaming • Creating operation maps destination register • On dispatch, register renamed to tag of allocated RS • Register table entry:= RS number • On completion, register written • Regiter table entry:=0 • Subsequent operation looks up sources in register table • Entry==0 -> register has already been written • Copy register value to RS • Eliminates WAR hazards (private valid copy of register in RS) • Entry!=0 ->register value not ready, some RS will provide • Copy entry (==RS tag) to RS, monitor CDB for that tag CDB: Common Data Bus
Branches • Instructions which can alter the flow of instruction execution in a program
Motivation F F F F D D D D A A A A M M M M W W W W A branch is fetched • Pipelined execution • A new intruction enters the pipeline every cycle... • …but still takes several cycles to execute • Control flow changes • Two possible paths after a branch is fetched • Introduces pipeline "bubbles" • Branch delay slots • Prediction offers a chance to avoid this bubbles But takes N cycles to execute Pipeline bubble
Techniques for handling branches • Stalling • Branch delay slots • Relies on programmer/compiler to fill • Depends on being able to find suitable instructions • Ties resolution delay to a particular pipeline
Why aren’t these techniques acceptable? • Branches are frequent - 15-25% • Today’s pipelines are deeper and wider • Higher performance penalty for stalling • Misprediction Penalty = issue width * resolution delay cycles • A lot of cycles can be wasted!!!
Branch Prediction • Predicting the outcome of a branch • Direction: • Taken / Not Taken • Direction predictors • Target Address • PC+offset (Taken)/ PC+4 (Not Taken) • Target address predictors • Branch Target Buffer (BTB)
Why do we need branch prediction? • Branch prediction • Increases the number of instructions available for the scheduler to issue. Increases instruction level parallelism (ILP) • Allows useful work to be completed while waiting for the branch to resolve
Branch Prediction Strategies • Static • Decided before runtime • Examples: • Always-Not Taken • Always-Taken • Backwards Taken, Forward Not Taken (BTFNT) • Profile-driven prediction • Dynamic • Prediction decisions may change during the execution of the program
What happens when a branch is predicted? • On misprediction: • No speculative state may commit • Squash instructions in the pipeline • Must not allow stores in the pipeline to occur • Cannot allow stores which would not have happened to commit • Even for good branch predictors more than half of the fetched instructions are squashed
Simple Static Predictors • Simple heuristics • Always taken • Always not taken • Backwards taken / Forward not taken • Relies on the compiler to arrange the code following this assertion • Certain opcodes taken • Programmer provided hints • Profiling
Dynamic Hardware Predictors • Dynamic Branch Prediction is the ability of the hardware to make an educated guess about which way a branch will go - will the branch be taken or not. • The hardware can look for clues based on the instructions, or it can use past history - we will discuss both of these directions.
A Generic Branch Predictor Predicted Stream PC, T or NT Fetch f(PC, x) Resolve Actual Stream f(PC, x) = T or NT Actual Stream Execution Order Predicted Stream - What’s f (PC, x)? - x can be any relevant info thus far x was empty
Bimodal Branch Predictors • Dynamically store information about the branch behaviour • Branches tend to behave in a fixed way • Branches tend to behave in the same way across program execution • Index a Pattern History Table using the branch address • 1 bit: branch behaves as it did last time • Saturating 2 bit counter: branch behaves as it usually does
Saturating-Counter Predictors • Consider strongly biased branch with infrequent outcome • TTTTTTTTNTTTTTTTTNTTTT • Last-outcome will misspredict twice per infrequent outcome encounter: • TTTTTTTTNTTTTTTTTNTTTT • Idea: Remember most frequent case • Saturating-Counter: Hysteresis • often called bi-modal predictor • Captures Temporal Bias
Bimodal Prediction • Table of 2-bit saturating counters • Predict the most common direction • Advantages: simple, cheap, “good” accuracy • Bimodal will misspredict once per infrequent outcome encounter: TTTTTTTTNTTTTTTTTNTTTT
Correlating Predictors • From program perspective: • Different Branches may be correlated • if (aa == 2) aa = 0; • if (bb == 2) bb = 0; • if (aa != bb) then … • Can be viewed as a pattern detector • Instead of keeping aggregate history information • I.e., most frequent outcome • Keep exact history information • Pattern of n most recent outcomes • Example: • BHR: n most recent branch outcomes • Use PC and BHR (xor?) to access prediction table
Pattern-based Prediction • Nested loops: for i = 0 to N for j = 0 to 3 … • Branch Outcome Stream for j-for branch • 11101110111011101110 • Patterns: • 111 -> 0 • 110 -> 1 • 101 -> 1 • 011 -> 1 • 100% accuracy • Learning time 4 instances • Table Index (PC, 3-bit history)
Two-level Branch Predictors • A branch outcome depends on the outcomes of previous branches • First level: Branch History Registers (BHR) • Global history / Branch correlation: past executions of all branches • Self history / Private history: past executions of the same branch • Second level: Pattern History Table (PHT) • Use first level information to index a table • Possibly XOR with the branch address • PHT: Usually saturating 2 bit counters • Also private, shared or global
Gshare Predictor (McFarling) Branch History Table Global BHR • PC and BHR can be • concatenated • completely overlapped • partially overlapped • xored, etc. • How deep BHR should be? • Really depends on program • But, deeper increases learning time • May increase quality of information Prediction f PC
Hybrid Prediction PC GSHARE Bimodal ... T/NT T/NT Selector T/NT • Combining branch predictors • Use two different branch predictors • Access both in parallel • A third table determines which prediction to use Two or more predictor components combined • Different branches benefit from different types of history