330 likes | 467 Views
CENG 450 Computer Systems and Architecture Lecture 9. Amirali Baniasadi amirali@ece.uvic.ca. This Lecture. Scoreboard: Review Tomasulo Branch Prediction. Instruction Buffer. Trick: instruction buffer (many names for this buffer) Accumulate decoded instructions in buffer
E N D
CENG 450Computer Systems and ArchitectureLecture 9 Amirali Baniasadi amirali@ece.uvic.ca
This Lecture • Scoreboard: Review • Tomasulo • Branch Prediction
Instruction Buffer • Trick: instruction buffer (many names for this buffer) • Accumulate decoded instructions in buffer • Buffer sends instructions down rest of pipe out-of-order instruction buffer IF ID1 ID2 EX MEM WB
Scoreboard State/Steps instruction buffer • Confusion in community about which is which stage IF IS RO EX WB ID Structure Data Bus Registers EX EX EX Control/Status Scoreboard
Scoreboard • Operands for an instruction are read only when both operands are available in the register file • Scoreboard does not take advantage of forwarding • Instructions write to register file as soon as they are complete execution (assuming no WAR hazards) and do not wait for write slot • One additional cycle of latency as write result and read operand stages cannot overlap • Bus structure • Limited number of buses to register file represent structural hazards
Scoreboard • Limitations • No forwarding (RAW dependence handled through registers) • In-order issue for WAW/structural hazards limit scheduling flexibility • WAR stalls limit dynamic loop unrolling (no register unrolling) • Performance • 1.7X for FORTRAN programs • 2.5X for hand-coded assembly • Hardware • Scoreboard is cheap • Busses are not
Tomasulo’s Algorithm • Developed for IBM 360/91 ~3 years after CDC 6600 (1966) • Goal: High Performance without special compilers • Differences between IBM 360 & CDC 6600 ISA • IBM has only 2 register specifiers per instruction vs. 3 in CDC 6600 • IBM has 4 FP registers vs. 8 in CDC 6600 • IBM has long memory access delays, long FP delays • Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, …
Tomasulo’s Algorithm • Avoid RAW Hazards • Execute an instruction only when its operands are available • Has a scheme to track when operands are available • Avoid WAR and WAW Hazards • Support Register renaming. • Renames all destination registers: Out-of-order write does not affect any instructions that depend on an earlier value of an operand • DIVD F0, F2, F4 DIVD F0, F2, F4 • ADDD F6, F0, F8 ADDD S, F0, F8 //S & T temp Reg • SD F6, 0(R1) SD S, 0(R1) • SUBD F8, F10, F14 SUBD T, F10, F14 • MULD F6, F10, F8 MULD F6, F10, T • Supports the overlapped execution of multiple iterations of a loop WAR WAW
LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8 BNE R1, R2, Loop LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8 BNE R1, R2, Loop LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8 BNE R1, R2, Loop LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8 BNE R1, R2, Loop Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD F4, 0(R1) LD F6, -8(R1) ADDD F8, F6, F2 SD F8, -8(R1) LD F10, -16(R1) ADDD F12, F10, F2 SD F12, -16(R1) LD F14, -24(R1) ADDD F16, F14, F2 SD F16, -24(R1) SUBI R1, R1, #32 BNE R1, R2, Loop Register Renaming Four copies of loop Four iteration code
Tomasulo Algorithm vs. Scoreboard • Control & buffers distributed with Function Units (FU) vs. centralized in scoreboard; (with bypassing) • FU buffers are called reservation stations; have pending operands • Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming • avoids WAR, WAW hazards • More reservation stations than registers, so can do optimizations compilers can’t • Results to FU from RS, not through registers, over Common Data Busthat broadcasts results to all Fus • Load and Stores treated as FUs with reservation stations as well
Three Stages of Tomasulo’s Algorithm • Issue—get instruction from FP Op Queue • If reservation station free (no structural hazard), issue instruction & operand values (if they are in the registers). • If reservation station is busy, instruction stalls • If operands are not in the registers – rename registers (eliminate WAR, WAW hazards) and keep track of functional units producing operands • Execution—operate on operands (EX) • If both operands ready then execute; • if not ready, watch Common Data Bus for result (Avoid RAW hazard) • Write result—finish execution (WB) • Write on Common Data Bus to all units; mark reservation station available • Normal data bus: data + destination (“go to” bus) • Common data bus: data + source (“come from” bus); Broadcasts Each stage can take different number of clock cycles
Reservation Station Components • Op • Operation to perform in the unit (e.g., + or –) • Vj, Vk • Value of Source operands • Store buffers have V field with result to be stored • Qj, Qk • Reservation stations producing source operand (Qj,Qk=0 => ready) • Busy • Indicates reservation station or FU is busy • Qi:Register result status • Indicates which functional unit (if exists) will write to the register. • ‘0’ when no pending instructions to write to this register.
Example LD F6, 34(R2) LD F2, 45(R3) MULT F0, F2, F4 SUBD F8, F6, F2 DIVD F10, F0, F6 ADDD F6, F8, F2 Latencies (clock cycles) LD 1 MULT 10 DIVD 40 ADDD, SUBD 2
Review: Tomasulo • Prevents Register as bottleneck • Avoids WAR, WAW hazards of Scoreboard • Allows loop unrolling in HW • Not limited to basic blocks (provides branch prediction) • Lasting Contributions • Dynamic scheduling • Register renaming • Load/store disambiguation • 360/91 descendants are PowerPC 604, 620; MIPS R10000; HP-PA 8000; Intel Pentium Pro
Example of WAR hazardsin Tomasulo’s Algorithm Example: LF F6, 34(R2) DIVF F10, F6, F0 ADDF F6, F8, F2 • ADDF can safely finish before DIVF has read register F6 because: • DIVF has renamed register F6 to point at LFs functional unit • LF broadcasts its result on the Common Data Bus
Register Renaming • Register renaming • Change register names to eliminate WAR/WAW hazards • Hardware renaming: most beautiful thing in architecture • Key: think of architectural registers as names, not locations • Can have more locations than names • Dynamically map names to locations • Map table: hardware structure holds current mappings • Writes allocate new location, note in map table • Reads find location of most recent write by looking at map table • Must de-allocate locations appropriately
Tomasulo Register Renaming • Creating operation maps destination register • On dispatch, register renamed to tag of allocated RS • Register table entry:= RS number • On completion, register written • Regiter table entry:=0 • Subsequent operation looks up sources in register table • Entry==0 -> register has already been written • Copy register value to RS • Eliminates WAR hazards (private valid copy of register in RS) • Entry!=0 ->register value not ready, some RS will provide • Copy entry (==RS tag) to RS, monitor CDB for that tag CDB: Common Data Bus