1 / 37

Chapter 2: ILP and Its Exploitation

Chapter 2: ILP and Its Exploitation. Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Multiple issue (superscalar) Hardware-based speculation ILP limitation Intel P6 microarchitecture. Dynamic Scheduling.

cera
Download Presentation

Chapter 2: ILP and Its Exploitation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 2: ILP and Its Exploitation • Review simple static pipeline • ILP Overview • Dynamic branch prediction • Dynamic scheduling, out-of-order execution • Multiple issue (superscalar) • Hardware-based speculation • ILP limitation • Intel P6 microarchitecture

  2. Dynamic Scheduling • If an instruction is stalled, there’s no need to stall later instructions that aren’t dependent on any of the stalled instructions, i.e. out-of-order execution • Example: DIVD F0,F2,F4  Long-running ADDD F10,F0,F8 Depends on DIVD SUBD F12,F8,F14  Independent of both • The ADDD is stalled before execution, but the SUBD can go ahead. • Encounter WAW, WAR harzards

  3. Splitting Instruction Decode • Single “Instruction Decode” stage split into 2 parts: • Instruction Issue or dispatch (in-order) • Determine instruction type • Check for structural hazards • Read Operands (can be out-of-order) • Stall instruction until no data hazards • Read operands • Release instruction to begin execution • Need some sort of queue or buffer to hold instructions till their operands are ready. • Note: Out-of-order completion makes precise exception handling difficult! How to handle? Issue Queue Read Operand Instruction Decode

  4. Tomasulo’s Algorithm • Tomasulo’s algorithm: • Another approach for dynamic scheduling, out-of-order execution • First used in IBM 360/91 FPU, many years ago • Based on key concept of dynamic register renaming • Like static renaming we used in loop-unroll example • Some features: • Copes with long-latency operations (FPU or mem.) • Eliminates WAR & WAR hazards without stalling • Instructions issue as soon as their operands are ready, direct forwarding, bypass register • Distributed hazard detection and execution control

  5. Tomasulo’s Algorithm • Key differences (from Scoreboarding) : • Hazard detection & inst issue is done per execution unit • Data results go straight to where they are needed, use CDB • Loads/stores get their own execution units • Use Reservation Station for register renaming Issue Logic /Control Unit CommonDataBus (CDB) RegisterFile Reser-vationStation Execution unit 1 Instruction Fetch Instruction Queue Reser-vationStation Execution unit 2 …

  6. Components of a Tomasulo Unit • Reservation stations (RSs) • Buffer the operands to pending instructions while they are waiting for operands to enter the execution units. • Issue logic • Redirects (renames) instructions’ register outputs to reservation-station slots. • Results go directly to RSs rather than thru reg. file. • Distributed hazard detection • Handled separately by each functional unit • Load & store buffers (can be combined with RS) • Queue up memory access requests

  7. Simple FPU using Tomasulo’s Algorithm

  8. Major Steps in Tomasulo (Fig 2.12) • Issue • Get instruction from FP instruction queue • If a slot in appropriate RS (or load-store buffer) is available, send instruction there; else stall it (structural hazard). • Send operand values to RS if already available, otherwise, just note the names (RS) where the operands to be available • Execute • While operands not yet available, monitor CDB for them. • When all operands are in RS, begin executing instruction. • Write result • When result available & CDB is free, write result to CDB, then to registers & RS/store slots for receiving instructions. • Update register status, RS’s value, flag, busy state, etc.

  9. Example for Tomasulo’s Algorithm • We will go through the same code fragment to see how Tomasulo’s Algorithm handles out-of-order Exec. • 1. LD F6,34(R2) • 2. LD F2,45(R3) • 3. MULTD F0,F2,F4 • 4. SUBD F8,F6,F2 • 5. DIVD F10,F0,F6 • 6. ADDD F6,F8,F2 DataDependence Anti-Dependence OutputDependence

  10. Reservation Station Fields • In each slot: • Op - The operation to perform on operands S1 & S2 • Qj, Qk - The RS slots that will produce S1, S2 • Vj, Vk - The values of S1 & S2. • Busy - RS & its execution unit are occupied • In register file entries & store buffer slots: • Qi - The RS slot containing the op whose result should be stored here. • In load and store buffers (combined in RS): • A : hold effective address for load and store.

  11. Instruction stream 3 Load/Buffers FU count down 3 FP Adder R.S. 2 FP Mult R.S. Clock cycle counter Tomasulo Example

  12. Cycle 1

  13. Cycle 2 Note: Can have multiple loads outstanding

  14. Cycle 3 • Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued • Load1 completing; what is waiting for Load1?

  15. Cycle 4 • Load2 completing; what is waiting for Load2?

  16. Cycle 5 • Timer starts down for Add1, Mult1

  17. Cycle 6 • Issue ADDD here despite name dependency on F6?

  18. Cycle 7 • Add1 (SUBD) completing; what is waiting for it?

  19. Cycle 8

  20. Cycle 9

  21. Cycle 10 • Add2 (ADDD) completing; what is waiting for it?

  22. Cycle 11 • Write result of ADDD here? • All quick instructions complete in this cycle!

  23. Cycle 12

  24. Cycle 13

  25. Cycle 14

  26. Cycle 15 • Mult1 (MULTD) completing; what is waiting for it?

  27. Cycle 16 • Just waiting for Mult2 (DIVD) to complete

  28. Cycle 55 (after skip cycles…)

  29. Cycle 56 • Mult2 (DIVD) is completing; what is waiting for it?

  30. Cycle 57 • Once again: In-order issue, out-of-order execution, and out-of-order completion.

  31. Tomasulo’s Two Major Advantages • Distribution of the hazard detection logic • distributed reservation stations and the CDB • If multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB • If a centralized register file were used, the units would have to read their results from the registers when register buses are available • Elimination of stalls for WAW and WAR hazards

  32. Elimination of WAR Hazards • Note the potential WAR hazard between DIVD and ADDD involving F6. • But, as soon as DIVD enters the RS, it becomes independent of the ADDD! • The 2nd source operand no longer refers to F6, but stores the value of F6 produced earlier by the LD. • If the LD had not yet completed, the 2nd operand would then refer to its R.S., but still not to F6! • So, ADDD can write its new value for F6 before DIVD executes, without messing it up!

  33. Elimination of WAW Hazards • Note the potential WAW hazard between First LD and last ADD involving F6. • But, as soon as ADD is issued, the register status table is updated with F6 assigned to “adder2” • So, LD when it completes will not update F6, thus eliminate WAW

  34. Tomasulo Drawbacks • Complexity • delays of 360/91, MIPS 10000, Alpha 21264, IBM PPC 620 in CA:AQA 2/e, but not in silicon! • Many associative stores (CDB) at high speed • Performance limited by Common Data Bus • Each CDB must go to multiple functional units high capacitance, high wiring density • Number of functional units that can complete per cycle limited to one! • Multiple CDBs  more FU logic for parallel assoc stores • Non-precise interrupts! • this will be addressed later

  35. Overlap Loop Interactions • Register renaming • Multiple iterations use different physical destinations for registers (dynamic loop unrolling). • Reservation stations • Permit instruction issue to advance past integer control flow operations • Also buffer old values of registers - totally avoiding the WAR stall • Other perspective: Tomasulo building data flow dependency graph on the fly • Note, branch prediction is still needed!

  36. Dynamic Loop Scheduling • Loop example: • Loop: LD F0,0(R1) MULTD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,#8 BNEZ R1,Loop • Note data dependences can span loop iterations. • But, using Tomasulo, & predict-taken, multiple iterations can issue and begin execution simultaneously! • Like dynamic loop unrolling by the HW.

  37. Check Figure 2.13

More Related