700 likes | 1.15k Views
Instruction Level Parallelism and Tomasulo’s approach. Instruction Level Parallelism. Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls Reduce stalls, reduce CPI Reduce CPI, increase IPC Instruction-level parallelism (ILP) seeks to reduce stalls
E N D
Instruction Level Parallelism andTomasulo’s approach CSCI 620 NOTE8
Instruction Level Parallelism • Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls • Reduce stalls, reduce CPI • Reduce CPI, increase IPC • Instruction-level parallelism (ILP) seeks to reduce stalls • Importance of ILP is more visible in Loop-level parallelism: • for (i=1; i<1000; i=i+1) • { • x[i] = x[i] + y[i]; • } CSCI 620 NOTE8
Major Techniques to increase ILP CSCI 620 NOTE8
Instruction Level Parallelism • ILP by SW (static) or HW (dynamic) techniques • HW intensive ILP dominates desktop and server markets • SW compiler intensive approaches more likely seen in embedded systems—but IA-64 uses the approach CSCI 620 NOTE8
Dependences • Two instructions are parallel if they can execute simultaneously in a pipeline without causing any stalls (assuming no structural hazards) and can be reordered • Two instructions that are dependent are not parallel and cannot be reordered—must be executed in-order—even though they can be partially overlapped • Three types of dependences • Data dependences(=true data dependences) • Name dependences • Control dependences CSCI 620 NOTE8
Dependences • Dependences are properties of programs • Whether a dependence results in an actual hazard(& the length of stalls) are properties of the pipeline organization • Dependence • indicates the potential for a hazard • Determines the order in which results must be calculated • Sets an upperbound for ILP • Problems caused by Dependences can be solved by: • Try to avoid by rescheduling • Eliminate by transforming the code (alter the code) • Compiler concerned about dependences in program, whether or not a HW hazard occurs depends on a given pipeline CSCI 620 NOTE8
Review of Data Hazards • Consider instructions i and j, where i occurs before j. • RAW (read after write) — j tries to read a source before i writes it, so j gets the old value • WAW (write after write) — j tries to write an operand before it is written by i (only possible in pipelines that write in more than one pipe stage or allow an instruction to proceed even when a previous instruction is stalled) • WAR (write after read) — j tries to write a destination before it is read by i, so i incorrectly gets the new value (only possible when some instructions can write results early in the pipeline and other instructions can read sources late in the pipeline) CSCI 620 NOTE8
(1) Data Dependences • (True) Data dependences • Instruction i produces a result used by instruction j(directly), or • Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i (inderectly). j k i j i • Easy to determine in cases of registers (fixed names) • Harder to determine for memory: • Does 100(R4) = 20(R6)? • From different loop iterations, does 20(R4) = 20(R4)? • Will see hardware technique in chap 2 i: ADD.D F0, F2, F4 j: SUB.D F6, F0, F8 CSCI 620 NOTE8
(2) Name Dependences • Second type of dependences called name dependence: two instructions use same name (same register or memory location) but don’t exchange data • Antidependence • Instruction j writes a register or memory location that instruction i reads from and instruction i must be executed first—if not, then WAR hazard • Output dependence • Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved—if not, then WAW • * Name Dependences are harder to handle for memory accesses • Does 100(R4) = 20 (R6)? • From different loop iterations, does 20(R4) = 20(R4)? i : ADD.D F0, F2, F4 j : SUB.D F2, F6, F8 i : ADD.D F0, F2, F4 j : SUB.D F0, F6, F8 CSCI 620 NOTE8
Register Renaming eliminates WAR & WAW • Assuming temporary registers S and T : • DIV.D F0, F2, F4 DIV.D F0, F2, F4 • ADD.D F6, F0, F8 ADD.D S, F0, F8 • S.D F6, 0(R1) S.D S, 0(R1) • SUB.D F8, F10, F14 SUB.D T, F10, F14 • MUL.D F6, F10, F8 MUL.D F6, F10, T • (True) Data Dependences ? • Antidependences(WAR) ? • Output dependences(WAW) ? • Which dependences are eliminated by renaming? • Subsequent F8 must be replaced by T • How about F6? Not needed to be replaced as F8 because MULT.D will change F6 WAR & WAW are eliminated by register renaming—will be implemented in hardware Register renaming (True) Data Dependences= (1) DIV.D—ADD.D (2) ADD.D—S.D (3) SUB.D—MUL.D Antidependences = ADD.D—SUB.D Output dependences = ADD.D—MUL.D CSCI 620 NOTE8
(3) Control Dependence • Final kind of dependence called control dependence • Example if pl {S1; }; if p2 {S2; } • S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1. • Note that S2 could be data dependent on S1. CSCI 620 NOTE8
Control Dependences • Two (obvious) constraints on control dependences: • An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch • An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch • S1; • if p1 {S1; • }; • if p2 {S2; • } • if p1 {S1; • }; • if p2 {S2; • } • if pl {S1; • }; • S3; • if p2 {S2; • } • S3 • if pl {S1; • }; • S3; • if p2 {S2; • } CSCI 620 NOTE8
Limitations of Scoreboarding(Scoreboard hardware onnext slide) • No forwarding hardware • Limited to instructions in basic block (small window) • Small number of functional units (structural hazards), especially integer/load/store units—only one each • Can not issue if structural or WAW hazards • Must wait until WAR hazards resolved • Imprecise exceptions due to out-of-order execution Improvement? Tomasulo’s Approach CSCI 620 NOTE8
FP mult FP mult Integer unit FP add FP divide Scoreboard Scoreboard Hardware— centralized control by Scoreboard Registers Data buses Data flows Control/status flows Control/status Control/status Figure A.50 The basic structure of a MIPS processor with a scoreboard Scoreboard originally proposed in CDC6600 (Seymore Cray,1964) CSCI 620 NOTE8
Busy – Indicates whether the unit is busy or not • Op – Operation to perform in the unit (e.g., add or subtract) • Fi – Destination register • Fj, Fk – Source-register numbers • Qj, Qk – Functional units producing source registers Fj, Fk • Rj, Rk – Flags indicating when Fj, Fk are available and not yet read. CSCI 620 NOTE8
Tomasulo’s Algorithm • For IBM 360/91 about 3 years after CDC 6600 (Late 1960s) • Goal: High performance without special compilers • Differences between Tomasulo’s Algorithm & Scoreboard • (Similar to Scoreboarding, but added Register Renaming) • Control & buffers (called “reservation stations”) distributed with functional units vs. centralized in scoreboard—Scoreboard/Inst buffer Reservation Stations for each FU • Registers in instructions replaced by pointers to reservation station buffer • HW renaming of registers to avoid WAR, WAW hazards • Common data bus (CDB) broadcasts results to functional units • Load and stores treated as functional units as well • Very Importantly • – Tomasulo’s algorithm are adopted to many modern CPUs; • Alpha 21264, HP PA-8000, MIPS R10K, Pentium III, Pentium 4, PowerPC 604, etc… CSCI 620 NOTE8
Key concept: Reservation Stations(RS) • • Distributed (rather than centralized) control scheme • – Bypassing(data directly to RS rather than via registers) is allowed via Common Data Bus (CDB) to RS • – Register Renaming eliminates WAR/WAW hazards • • Scoreboard/Instruction Buffer => Reservation Stations • – Fetch and Buffer operands as soon as available • • Eliminates need to always get values from registers at execute • – Pending instructions designate reservation stations that will provide their inputs • – Successive writes to a register cause only the last one to update the register CSCI 620 NOTE8
MIPS Floating-point unit using Tomasulo’s Algorithm CSCI 620 NOTE8
Details • Each reservation station holds instructions that has been issued and waiting for execution—an instruction may already have all the operands or it has the name(s) of RS or the names of load buffers which will provide them. These name fields are called “tags”—4-bits each to denote one of 5 RSs & 6 Load buffers—RSs are used for renaming • Load buffer & Store buffer behave almost exactly like RS • All results from the FUs and from memory are sent on the Common Data Bus which is connected to everywhere except the Load buffer CSCI 620 NOTE8
Three Stages of Tomasulo’s Algorithm • 1. Issue: Get the next instruction from FP operation queue (FIFO) If reservation station free (if Not free stall (=structural hazard)), issues instruction & sends operands (if available in register, else provide name of FU(=renaming)). Avoids WAR & WAW • 2. Execution: Operate on operands (EX) • When both operands ready(already in Vj/Vk or from CDB), get them, then execute; if not ready, watch common data bus for result. RAW avoided • 3. Write result: Finish execution (WB) • Write on common data bus so that all awaiting FUs can hear; mark reservation station as available. • Common data bus: 64 bit data + 4 bit source (“come from”) CSCI 620 NOTE8
Data Buses in Tomasulo’s Algorithm • • Compare to Normal data bus which has: data + destination (“go to” bus) • • CDB(Common Data Bus): data + source (“come from” bus) • – 64 bits of data + 4 bits of Functional Unit source address • (RS’s number) • – Any receiving unit(Store buffer, RSs, FP registers) will accept(Write) if the RS’s number matches the expected number CSCI 620 NOTE8
Reservation Station Components • Op – Operation to perform in the unit (e.g., + or – ) • Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here • Vj, Vk – Registers that store the Value of source operands—temp registers for renaming • Busy – Indicates reservation station and FU is busy • Register result status – Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register. CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy Load & Store require 2 steps: Step 1: Compute effective addr(ea) Step 2: Place ea in buffer Execution(Load or Store) can start when memory unit is not busy CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8
Wait until DIVD finishesDivide takes 40 cycles CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy Assuming(for Scoreboard): Add takes 2 clock cycles, multiply=10, divide=40 Tomasulo Scoreboard • Why take longer on scoreboard of CDC 6600? Structural Hazards Lack of forwarding • Both in-order issue and out-of-order execution • Scoreboard cannot handle WAR & WAW • Tomasulo can with register renaming • Both will stall with Branch instruction—later see Tomasulo with Speculation CSCI 620 NOTE8
Let’s try this site--http://www.ecs.umass.edu/ece/koren/architecture/Tomasulo/AppletTomasulo.html CSCI 620 NOTE8
Tomasulo’s Algorithm: A Loop-Based Example Loop: LD F0 0(R1) MULTD F4 F0 F2 SD F4 0(R1) SUBI R1 R1 #8 BNEZ R1 Loop • Multiply takes 4 clocks • Assume first load takes 8 clocks (cache miss), second load takes 1 clock (hit)—on a cache miss, a block(several words) is brought into the cache • Reality: integer instructions run ahead CSCI 620 NOTE8
Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8
Cache miss occurs, so LD must wait for 8 cycles Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8
Cache miss occurs, so LD must wait for 8 cycles Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8