280 likes | 592 Views
Tomasulo Dynamic Scheduling. Dynamic Issue. In IBM 360/91 about 3 years after CDC 6600 (1966) Goal: High Performance without special compilers Things to remember about the 60’s: No caches, no RISC, very few registers, no precise exceptions Differences between IBM 360 & CDC 6600 ISA
E N D
Dynamic Issue • In IBM 360/91 about 3 years after CDC 6600 (1966) • Goal: High Performance without special compilers • Things to remember about the 60’s: • No caches, no RISC, very few registers, no precise exceptions • Differences between IBM 360 & CDC 6600 ISA • IBM has only 2 register specifiers/instr vs. 3 in CDC 6600 • IBM has 4 FP registers vs. 8 in CDC 6600 • Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, …
Dynamic Issue Goal: take advantage of multiple function units and deal with long memory latencies • Advantages: • Speed • Problems: multiple execution latencies • Result is out of order completion • Forwarding and hazard control become more difficult • Precise exceptions would later amplify the problem (non-issue in the ’60s) • Answer: HW to issue instructions when hazards clear
Dynamic Issue • Hazards = data, structural, control • Data: RAW (true data dependence), WAR ( anti-dependence), WAW (output dependence) • Structural: Are the required resources available? • Control: Is this instruction supposed to execute or not? • Implementation – 2 early approaches • Control flow – CDC 6600 (scoreboard) (1964) • Data flow – Tomasulo, IBM 360/91 (1967) • Simple idea – when opcode and operands are ready, and the appropriate set of resources are ready, launch the “execution packet” • Interesting wrinkle – does not used named registers for intermediate storage • Implicit introduction of Register Renaming
Tomasulo vs. Scoreboard • Control & buffers distributed with Function Units (FU) vs. centralized in scoreboard; • FU buffers called “reservation stations”; have pending operands • Registers in instructions replaced by values or pointers to reservation stations(RS); called registerrenaming; • avoids WAR, WAW hazards • More reservation stations than registers, so can do optimizations compilers can’t • Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs • Load and Stores treated as FUs with RSs as well • Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue
Tomasulo Organization FP Op Queue FPRegisters LoadBuffer StoreBuffer CommonDataBus FP AddRes.Station FP MulRes.Station
Reservation Station Duties • Snarf sources off CDB when they appear • CDB results are tagged with where they came from • When all operands are present, enable the associate FU to execute • Since values aren’t really written to registers (until later): no WAR or WAW hazards are possible • Structural hazards checked at two points • At dispatch – a free reservation station of the right type must be available • When execution packet is ready – multiple reservatino stations may compete for a shared FU • Program order used as basis for arbitration if required
Virtual Registers • Tag field associated with data • Tag field is a virtual register ID • Corresponds to reservation station and load buffer names • Motivation due to the 360’s register weakness • Had only 4 FP regs • The 9 renamed regs (reservation station slots) were a significant bonus • Intel’s x86 architecture is also register-poor • With renamed registers they can get around this
Three Stages of Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available • Normal data bus: data + destination (“go to” bus) • Common data bus: data + source (“come from” bus) • 64 bits of data + 4 bits of Functional Unit source address • Write if matches expected Functional Unit (produces result) • Does the broadcast
Reservation Station Components Op—Operation to perform in the unit (e.g., + or –) Vj, Vk—Value of Source operands • Store buffers has V field, result to be stored Qj, Qk—Reservation stations producing source registers (value to be written) • Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready • Store buffers only have Qi for RS producing result Busy—Indicates reservation station or FU is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.
Tomasulo Example Cycle 4 Sort of like Figure 4.9 in your text
Tomasulo Example Cycle 8 • Note: ADDD can execute (and complete) before DIVD issues because an old version of F6 is stored in the reservation station which avoids the WAR hazard
Tomasulo Example Cycle 14 This is Figure 4.10 in the text
Tomasulo Example Cycle 16 • Now do 38 more DIVD cycles and then write back F10 to finish
Review: Tomasulo • Prevents Register as bottleneck • Where’s the new bottleneck? • Avoids WAR, WAW hazards of Scoreboard • If we assume branch prediction (next subject…) • Allows loop unrolling in HW • Not limited to basic blocks • Lasting Contributions • Dynamic scheduling • Register renaming • Load/store disambiguation • Out of order is OK if addresses don’t match • 360/91 descendants are PowerPC 604, 620; MIPS R10000; HP-PA 8000; Intel Pentium Pro