380 likes | 549 Views
Chapter 2: ILP and Its Exploitation. Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Multiple issue (superscalar) Hardware-based speculation ILP limitation Intel P6 microarchitecture. Dynamic Scheduling.
E N D
Chapter 2: ILP and Its Exploitation • Review simple static pipeline • ILP Overview • Dynamic branch prediction • Dynamic scheduling, out-of-order execution • Multiple issue (superscalar) • Hardware-based speculation • ILP limitation • Intel P6 microarchitecture
Dynamic Scheduling • If an instruction is stalled, there’s no need to stall later instructions that aren’t dependent on any of the stalled instructions, i.e. out-of-order execution • Example: DIVD F0,F2,F4 Long-running ADDD F10,F0,F8 Depends on DIVD SUBD F12,F8,F14 Independent of both • The ADDD is stalled before execution, but the SUBD can go ahead. • Encounter WAW, WAR harzards
Splitting Instruction Decode • Single “Instruction Decode” stage split into 2 parts: • Instruction Issue or dispatch (in-order) • Determine instruction type • Check for structural hazards • Read Operands (can be out-of-order) • Stall instruction until no data hazards • Read operands • Release instruction to begin execution • Need some sort of queue or buffer to hold instructions till their operands are ready. • Note: Out-of-order completion makes precise exception handling difficult! How to handle? Issue Queue Read Operand Instruction Decode
Tomasulo’s Algorithm • Tomasulo’s algorithm: • Another approach for dynamic scheduling, out-of-order execution • First used in IBM 360/91 FPU, many years ago • Based on key concept of dynamic register renaming • Like static renaming we used in loop-unroll example • Some features: • Copes with long-latency operations (FPU or mem.) • Eliminates WAR & WAR hazards without stalling • Instructions issue as soon as their operands are ready, direct forwarding, bypass register • Distributed hazard detection and execution control
Tomasulo’s Algorithm • Key differences (from Scoreboarding) : • Hazard detection & inst issue is done per execution unit • Data results go straight to where they are needed, use CDB • Loads/stores get their own execution units • Use Reservation Station for register renaming Issue Logic /Control Unit CommonDataBus (CDB) RegisterFile Reser-vationStation Execution unit 1 Instruction Fetch Instruction Queue Reser-vationStation Execution unit 2 …
Components of a Tomasulo Unit • Reservation stations (RSs) • Buffer the operands to pending instructions while they are waiting for operands to enter the execution units. • Issue logic • Redirects (renames) instructions’ register outputs to reservation-station slots. • Results go directly to RSs rather than thru reg. file. • Distributed hazard detection • Handled separately by each functional unit • Load & store buffers (can be combined with RS) • Queue up memory access requests
Major Steps in Tomasulo (Fig 2.12) • Issue • Get instruction from FP instruction queue • If a slot in appropriate RS (or load-store buffer) is available, send instruction there; else stall it (structural hazard). • Send operand values to RS if already available, otherwise, just note the names (RS) where the operands to be available • Execute • While operands not yet available, monitor CDB for them. • When all operands are in RS, begin executing instruction. • Write result • When result available & CDB is free, write result to CDB, then to registers & RS/store slots for receiving instructions. • Update register status, RS’s value, flag, busy state, etc.
Example for Tomasulo’s Algorithm • We will go through the same code fragment to see how Tomasulo’s Algorithm handles out-of-order Exec. • 1. LD F6,34(R2) • 2. LD F2,45(R3) • 3. MULTD F0,F2,F4 • 4. SUBD F8,F6,F2 • 5. DIVD F10,F0,F6 • 6. ADDD F6,F8,F2 DataDependence Anti-Dependence OutputDependence
Reservation Station Fields • In each slot: • Op - The operation to perform on operands S1 & S2 • Qj, Qk - The RS slots that will produce S1, S2 • Vj, Vk - The values of S1 & S2. • Busy - RS & its execution unit are occupied • In register file entries & store buffer slots: • Qi - The RS slot containing the op whose result should be stored here. • In load and store buffers (combined in RS): • A : hold effective address for load and store.
Instruction stream 3 Load/Buffers FU count down 3 FP Adder R.S. 2 FP Mult R.S. Clock cycle counter Tomasulo Example
Cycle 2 Note: Can have multiple loads outstanding
Cycle 3 • Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued • Load1 completing; what is waiting for Load1?
Cycle 4 • Load2 completing; what is waiting for Load2?
Cycle 5 • Timer starts down for Add1, Mult1
Cycle 6 • Issue ADDD here despite name dependency on F6?
Cycle 7 • Add1 (SUBD) completing; what is waiting for it?
Cycle 10 • Add2 (ADDD) completing; what is waiting for it?
Cycle 11 • Write result of ADDD here? • All quick instructions complete in this cycle!
Cycle 15 • Mult1 (MULTD) completing; what is waiting for it?
Cycle 16 • Just waiting for Mult2 (DIVD) to complete
Cycle 56 • Mult2 (DIVD) is completing; what is waiting for it?
Cycle 57 • Once again: In-order issue, out-of-order execution, and out-of-order completion.
Tomasulo’s Two Major Advantages • Distribution of the hazard detection logic • distributed reservation stations and the CDB • If multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB • If a centralized register file were used, the units would have to read their results from the registers when register buses are available • Elimination of stalls for WAW and WAR hazards
Elimination of WAR Hazards • Note the potential WAR hazard between DIVD and ADDD involving F6. • But, as soon as DIVD enters the RS, it becomes independent of the ADDD! • The 2nd source operand no longer refers to F6, but stores the value of F6 produced earlier by the LD. • If the LD had not yet completed, the 2nd operand would then refer to its R.S., but still not to F6! • So, ADDD can write its new value for F6 before DIVD executes, without messing it up!
Elimination of WAW Hazards • Note the potential WAW hazard between First LD and last ADD involving F6. • But, as soon as ADD is issued, the register status table is updated with F6 assigned to “adder2” • So, LD when it completes will not update F6, thus eliminate WAW
Tomasulo Drawbacks • Complexity • delays of 360/91, MIPS 10000, Alpha 21264, IBM PPC 620 in CA:AQA 2/e, but not in silicon! • Many associative stores (CDB) at high speed • Performance limited by Common Data Bus • Each CDB must go to multiple functional units high capacitance, high wiring density • Number of functional units that can complete per cycle limited to one! • Multiple CDBs more FU logic for parallel assoc stores • Non-precise interrupts! • this will be addressed later
Overlap Loop Interactions • Register renaming • Multiple iterations use different physical destinations for registers (dynamic loop unrolling). • Reservation stations • Permit instruction issue to advance past integer control flow operations • Also buffer old values of registers - totally avoiding the WAR stall • Other perspective: Tomasulo building data flow dependency graph on the fly • Note, branch prediction is still needed!
Dynamic Loop Scheduling • Loop example: • Loop: LD F0,0(R1) MULTD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,#8 BNEZ R1,Loop • Note data dependences can span loop iterations. • But, using Tomasulo, & predict-taken, multiple iterations can issue and begin execution simultaneously! • Like dynamic loop unrolling by the HW.