190 likes | 349 Views
Computer Architecture Principles Dr. Mike Frank. CDA 5155 Summer 2003 Module #20 Dynamic Scheduling with Tomasulo’s Algorithm. Tomasulo’s Algorithm. Tomasulo’s Algorithm. Earlier, we talked about scoreboarding. Tomasulo’s algorithm: Another approach for dynamic scheduling
E N D
Computer Architecture PrinciplesDr. Mike Frank CDA 5155Summer 2003 Module #20Dynamic Scheduling with Tomasulo’s Algorithm
Tomasulo’s Algorithm • Earlier, we talked about scoreboarding. • Tomasulo’s algorithm: • Another approach for dynamic scheduling • First used in IBM 360/91 FPU, many years ago • Based on concept of dynamic register renaming • Like static renaming we used in loop-unroll example • Some features: • Copes with long-latency operations (FPU or mem.) • Eliminates WAR & WAW hazards w/o stalling • Instrs. issue as soon as their operands are ready • Supports overlapping loop iterations
Scoreboarding Review • Some limitations: • Structural & WAW hazards, & instruction issue administered centrally • Results must go through the register file Scoreboard /Control Unit Pre-executionbuffers Post-executionbuffers Readoperands Execution unit 1 Instruction Fetch Instruction Issue Write results Readoperands Execution unit 2 Pre-issuebuffer … Instruction Decode
Tomasulo’s Algorithm is Similar • Key differences: • Hazard det. & inst issue is done per execution unit • Data results go straight to where they are needed • Loads/stores get their own execution units Issue Logic /Control Unit RegisterFile Reser-vationStation Execution unit 1 CommonDataBus (CDB) Instruction Fetch Instruction Queue Reser-vationStation Execution unit 2 … The Activity Formerly Known as Instruction Decode & Register Fetch
Components of a Tomasulo unit • Reservation stations (RSs) • Buffer the operands to pending instructions while they are waiting to enter the execution units. • Effectively provides extra, non-programmer-visible “renaming” registers, dynamically avoids WAW/WAR hzds. • Issue logic • Redirects (renames) instructions’ register outputs to reservation-station slots. • Results go directly to RSs rather than thru reg. file. • Distributed hazard detection • Handled separately by each functional unit • Load & store buffers • Queue up memory access requests
Major Steps in Tomasulo • 1. Issue • Get instruction from FP op queue • If a slot in appropriate RS (or load-store buffer) is available, send instruction there; else stall it (structural hazard). • Send operand values to RS if already available, otherwise, just note the names of the operands in the RS • 2. Execute. • While operands not yet available, monitor CDB for them. • When all operands are in RS, begin executing instruction. • 3. Write result. • When result available & CDB is free, write result to CDB, then to registers & RS/store slots for receiving instructions.
Reservation Station Fields • In each slot: • Op – The operation to perf. on operands S1 & S2 • Qj, Qk – The RS slots that will produce S1, S2. • Vj, Vk – The values of S1 & S2, if already obtained. • Busy – This RS & its execution unit are occupied. • In register file entries & store buffer slots: • Qi – The RS slot containing the op whose result should be stored here. • In load and store buffers: • Busy – This slot is in use.
Code Example (revisited) • We will go through the same code fragment that was used in the scoreboarding example: 1. LD F6,34(R2) 2. LD F2,45(R3) 3. MULTD F0,F2,F4 4. SUBD F8,F6,F2 5. DIVD F10,F0,F6 6. ADDD F6,F8,F2 DataDependence Anti-Dependence OutputDependence
Elimination of WAR hazards • Note the potential WAR hazard between DIVD and ADDD involving F6. • But, as soon as DIVD enters the RS, it becomes independent of the ADDD! • Its 2nd source operand no longer refers to F6, but stores the value of F6 produced earlier by the LD. • If the LD had not yet completed, the 2nd operand would then refer to LD’s R.S., but still not to F6! • So, ADDD can write its new value for F6 before DIVD executes, w/o messing it up!
~Timing of Tomasulo Example #1 #2 (RAW hazardscausing stallsare shown) LD F6,34(R2) rs ex w LD F2,45(R3) rs ex w MULTD F0,F2,F4 rs~~~~ multiplier w SUBD F8,F6,F2 rs~~~ sub w DIVD F10,F0,F6 rs~~~~~~~~~~~~~~~ divide………… w ADDD F6,F8,F2 rs~~~~~~~ add w • rs - Inst. has been issued, and is in res. station or buffer • ex,add,sub,multiplier,divide - Inst. is executing • w - Inst. in write-back stage. • Note: • Like in scoreboarding, MULTD & SUBD can start concurrently, right after 2nd LD completes. • ADDD can complete before DIVD reads operands! Fig. 3.3 time Fig. 3.4 time
More Details of the Algorithm D/S1/S2=dest./srcs; r/x=station of instruction/any; Register/RS/Store=data structs; Import source operands Tell register it shouldexpect to hear from us later Write dest. reg. (if still expecting) Write to expectant RS slots Write to expectant Store buffer slots
Dynamic Loop Scheduling • Loop example: Loop: LD F0,0(R1) MULTD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,#8 BNEZ R1,Loop • Note data dependences on loop variable (here, R1) can extend from one loop iteration to the next. • But, using Tomasulo, & predict-taken, multiple iterations can still issue and begin execution simultaneously! • Like dynamic loop unrolling, but done by the HW.
Drawbacks to Tomasulo • Complex, requires a lot of hardware • Less important today as transistors/die increases • May be important again when considering low-power processors and chip multiprocessors (CMPs). • Difficult to perform associative access to many RS entries at high speed. • CDB can be a limiting factor – • multiple CDBs possible, but adds overhead in RS write ports.
When Tomasulo is most useful • When one anticipates a need to run binaries for earlier pipeline implementations • When the code is difficult to schedule statically • e.g. when there are many dynamically-resolved dependences through memory • Where there aren’t enough programmer-visible registers in the ISA to do static register renaming • When there are many functional units available, and scoreboarding would be too much of a bottleneck due to its poor handling of WAR/WAW hazards.