Chapter 2: ILP and Its Exploitation

Chapter 2: ILP and Its Exploitation • Review simple static pipeline • ILP Overview • Dynamic branch prediction • Dynamic scheduling, out-of-order execution • Multiple issue (superscalar) • Hardware-based speculation • ILP limitation • Intel P6 microarchitecture

Dynamic Scheduling • If an instruction is stalled, there’s no need to stall later instructions that aren’t dependent on any of the stalled instructions, i.e. out-of-order execution • Example: DIVD F0,F2,F4  Long-running ADDD F10,F0,F8 Depends on DIVD SUBD F12,F8,F14  Independent of both • The ADDD is stalled before execution, but the SUBD can go ahead. • Encounter WAW, WAR harzards

Splitting Instruction Decode • Single “Instruction Decode” stage split into 2 parts: • Instruction Issue or dispatch (in-order) • Determine instruction type • Check for structural hazards • Read Operands (can be out-of-order) • Stall instruction until no data hazards • Read operands • Release instruction to begin execution • Need some sort of queue or buffer to hold instructions till their operands are ready. • Note: Out-of-order completion makes precise exception handling difficult! How to handle? Issue Queue Read Operand Instruction Decode

Tomasulo’s Algorithm • Tomasulo’s algorithm: • Another approach for dynamic scheduling, out-of-order execution • First used in IBM 360/91 FPU, many years ago • Based on key concept of dynamic register renaming • Like static renaming we used in loop-unroll example • Some features: • Copes with long-latency operations (FPU or mem.) • Eliminates WAR & WAR hazards without stalling • Instructions issue as soon as their operands are ready, direct forwarding, bypass register • Distributed hazard detection and execution control

Tomasulo’s Algorithm • Key differences (from Scoreboarding) : • Hazard detection & inst issue is done per execution unit • Data results go straight to where they are needed, use CDB • Loads/stores get their own execution units • Use Reservation Station for register renaming Issue Logic /Control Unit CommonDataBus (CDB) RegisterFile Reser-vationStation Execution unit 1 Instruction Fetch Instruction Queue Reser-vationStation Execution unit 2 …

Components of a Tomasulo Unit • Reservation stations (RSs) • Buffer the operands to pending instructions while they are waiting for operands to enter the execution units. • Issue logic • Redirects (renames) instructions’ register outputs to reservation-station slots. • Results go directly to RSs rather than thru reg. file. • Distributed hazard detection • Handled separately by each functional unit • Load & store buffers (can be combined with RS) • Queue up memory access requests

Simple FPU using Tomasulo’s Algorithm

Major Steps in Tomasulo (Fig 2.12) • Issue • Get instruction from FP instruction queue • If a slot in appropriate RS (or load-store buffer) is available, send instruction there; else stall it (structural hazard). • Send operand values to RS if already available, otherwise, just note the names (RS) where the operands to be available • Execute • While operands not yet available, monitor CDB for them. • When all operands are in RS, begin executing instruction. • Write result • When result available & CDB is free, write result to CDB, then to registers & RS/store slots for receiving instructions. • Update register status, RS’s value, flag, busy state, etc.

Example for Tomasulo’s Algorithm • We will go through the same code fragment to see how Tomasulo’s Algorithm handles out-of-order Exec. • 1. LD F6,34(R2) • 2. LD F2,45(R3) • 3. MULTD F0,F2,F4 • 4. SUBD F8,F6,F2 • 5. DIVD F10,F0,F6 • 6. ADDD F6,F8,F2 DataDependence Anti-Dependence OutputDependence

Reservation Station Fields • In each slot: • Op - The operation to perform on operands S1 & S2 • Qj, Qk - The RS slots that will produce S1, S2 • Vj, Vk - The values of S1 & S2. • Busy - RS & its execution unit are occupied • In register file entries & store buffer slots: • Qi - The RS slot containing the op whose result should be stored here. • In load and store buffers (combined in RS): • A : hold effective address for load and store.

Instruction stream 3 Load/Buffers FU count down 3 FP Adder R.S. 2 FP Mult R.S. Clock cycle counter Tomasulo Example

Cycle 1

Cycle 2 Note: Can have multiple loads outstanding

Cycle 3 • Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued • Load1 completing; what is waiting for Load1?

Cycle 4 • Load2 completing; what is waiting for Load2?

Cycle 5 • Timer starts down for Add1, Mult1

Cycle 6 • Issue ADDD here despite name dependency on F6?

Cycle 7 • Add1 (SUBD) completing; what is waiting for it?

Cycle 8

Cycle 9

Cycle 10 • Add2 (ADDD) completing; what is waiting for it?

Cycle 11 • Write result of ADDD here? • All quick instructions complete in this cycle!

Cycle 12

Cycle 13

Cycle 14

Cycle 15 • Mult1 (MULTD) completing; what is waiting for it?

Cycle 16 • Just waiting for Mult2 (DIVD) to complete

Cycle 55 (after skip cycles…)

Cycle 56 • Mult2 (DIVD) is completing; what is waiting for it?

Cycle 57 • Once again: In-order issue, out-of-order execution, and out-of-order completion.

Tomasulo’s Two Major Advantages • Distribution of the hazard detection logic • distributed reservation stations and the CDB • If multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB • If a centralized register file were used, the units would have to read their results from the registers when register buses are available • Elimination of stalls for WAW and WAR hazards

Elimination of WAR Hazards • Note the potential WAR hazard between DIVD and ADDD involving F6. • But, as soon as DIVD enters the RS, it becomes independent of the ADDD! • The 2nd source operand no longer refers to F6, but stores the value of F6 produced earlier by the LD. • If the LD had not yet completed, the 2nd operand would then refer to its R.S., but still not to F6! • So, ADDD can write its new value for F6 before DIVD executes, without messing it up!

Elimination of WAW Hazards • Note the potential WAW hazard between First LD and last ADD involving F6. • But, as soon as ADD is issued, the register status table is updated with F6 assigned to “adder2” • So, LD when it completes will not update F6, thus eliminate WAW

Tomasulo Drawbacks • Complexity • delays of 360/91, MIPS 10000, Alpha 21264, IBM PPC 620 in CA:AQA 2/e, but not in silicon! • Many associative stores (CDB) at high speed • Performance limited by Common Data Bus • Each CDB must go to multiple functional units high capacitance, high wiring density • Number of functional units that can complete per cycle limited to one! • Multiple CDBs  more FU logic for parallel assoc stores • Non-precise interrupts! • this will be addressed later

Overlap Loop Interactions • Register renaming • Multiple iterations use different physical destinations for registers (dynamic loop unrolling). • Reservation stations • Permit instruction issue to advance past integer control flow operations • Also buffer old values of registers - totally avoiding the WAR stall • Other perspective: Tomasulo building data flow dependency graph on the fly • Note, branch prediction is still needed!

Dynamic Loop Scheduling • Loop example: • Loop: LD F0,0(R1) MULTD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,#8 BNEZ R1,Loop • Note data dependences can span loop iterations. • But, using Tomasulo, & predict-taken, multiple iterations can issue and begin execution simultaneously! • Like dynamic loop unrolling by the HW.

Check Figure 2.13

Chapter 2: ILP and Its Exploitation

Chapter 2: ILP and Its Exploitation

Presentation Transcript

Chapter

Introduction to Clinical Medicine

Cooperative Strategy

Strategic Entrepreneurship

International Strategy

Chapter 2 Elementary Programming

Corporate Governance

Civil Engineering Materials

Abuse, Neglect and Exploitation of Vulnerable Adults

Unit 5: The Growing Nation Chapter 13: North and South Chapter 14: Age of Reform Chapter 15: Road to Civil War

Chapter 6

Kansas ICAC Task Force Child Exploitation Investigations

Java Programming

Chapter 22: Distributed Databases

The New West

Introduction to Clinical Medicine

Thermochemistry

Chapter 5: Other Relational Languages

The New West