EECS 470

EECS 470 Dynamic Scheduling – Part I Lecture 9 Coverage: Chapter 3

ROB Reorder Buffer Recap • @ Alloc • Allocate result storage at Tail • @ Sched • Get inputs (ROB T-to-H then ARF) • Wait until all inputs ready • @ WB • Write results/fault to ROB • Indicate result is ready • @ CT • Wait until inst @ Head is done • If fault, initiate handler • Else, write results to ARF • Deallocate entry from ROB Any order MEM IF ID Alloc Sched EX CT In-order In-order ARF PC Dst regID Dst value Except? Head Tail • Reorder Buffer (ROB) • Circular queue of spec state • May contain multiple definitions of same register

ROB Register Renaming Recap Any order @ REN • Index table with source operand regID to locate ROB/ARF entry • Update table with dest regID with ROB assigned for dest @ Sched • Get inputs from ROB/ARF entry specified by REN • Wait until all inputs ready @ CT • Wait until inst @ Head is done • If fault, initiate handler • Else, write results to ROB/ARF entry specified by REN • Deallocate entry from ROB • Invalidate rename table entry @ dest regID iff the entry still points to ROB entry being deallocated MEM IF ID Alloc REN Sched EX CT In-order In-order ARF Rename Table regID robIDX Rename Table • Returns (valid, robIDX) • If valid, ROB does/will contain value of register • If invalid, ARF holds value (no instruction in flight defines this register) • Indexed with regID robIDX v Why?

Putting It All Together: Out-of-Order Issue Program Order Out-of-Order Schedule • Goal: use ILP to get more work done, thus shorten run-time • Possible at compile time or run-time…trade-offs? • Most effective around branches, stores, and with few registers • H/W uses dynamic scheduler • Invented at IBM in the mid-60’s • Also called Tomasulo’s Algorithm • Instructions in reservation station • Wake up when sources ready • Select instructions each cycle I1 I1 I3 I2 I2 I3 I4 I4

RS ROB Value V phyID V phyID Value Op dstID Dynamic Instruction Scheduling Any order Any order @ Alloc • Allocate ROB storage at Tail • Allocate RS for instruction @ REG • Get inputs from ROB/ARF entry specified by REN • Write instruction with available operands into assigned RS @ WB • Write result into ROB entry • Broadcast result into RS with phyID of dest register • Dellocate RS entry (requiresmaintenance of an RS free map) MEM IF ID Alloc REN REG EX WB CT In-order In-order ARF Reservation Stations (RS) • Associative storage indexedby phyID of dest, returnsinsts ready to execute • phyID is ROB index of inst that will compute operand (used to match on broadcast) • Value contains actual operand • Valid bits set when operand is available (after broadcast)

RS The Wakeup-Select-Execute Loop To EX/MEM dstID result = = grant src1 val1 src2 val2 dest MEM EX WB req = = Selection Logic src1 val1 src2 val2 dest = = src1 val1 src2 val2 dest

src1 src1 src1 val1 val1 val1 src2 src2 src2 val2 val2 val2 dest dest dest Dynamic Scheduling Example x p41 = p52 + p43 p52 8 p43 p41 p42 = p41 + p43 p41 p43 p42 x x 1 p43 = p51 + p50 p51 1 p50 2 p43

src1 src1 src1 val1 val1 val1 src2 src2 src2 val2 val2 val2 dest dest dest Dynamic Scheduling Example x x p41 = p52 + p43 2 p52 3 8 p43 p41 x p42 = p41 + p43 p41 3 p43 p42 x x 1 p43 = p51 + p50 p51 1 p50 2 p43

src1 src1 src1 val1 val1 val1 src2 src2 src2 val2 val2 val2 dest dest dest Dynamic Scheduling Example x x p41 = p52 + p43 2 p52 3 8 p43 p41 x x p42 = p41 + p43 3 p41 11 3 p43 p42 x x 1 p43 = p51 + p50 p51 1 p50 2 p43

Selection Logic • Why do we need it? • More instructions may “wake up” than we have resources to execute them • Which is the best instruction to choose? • The best one • The inst that will result in the shortest run-time, on program critical path • Computationally infeasible to identify this instruction • Random • A suitable baseline • The one closest to the left/right side of RS pool • Simple to implement, only requires a priority select logic • Similar to random, due to out-of-order deallocation of RS entries • Oldest First (inst closest to the Head of the ROB) • Slightly more complicated to implement than random techniques • Usually a good choice, long latency inst, inst with many output dependencies

Window Size vs. Clock Speed • Increasing the number of RS [Brainiac] • Longer broadcast paths • Thus more capacitance, and slower signal propogation • More ILP parallelism extracted • Decreasing the number of RS [Speed Demon] • Shorter broadcast paths • Thus less capacitance, and slower signal propagation • Less ILP parallelism extracted • Which approach is better and when?

Cross-cutting Issue: Mispeculation • What are the impacts of mispeculation or exceptions? • When instructions are flushed from the pipeline, reclaim RS entries freed • Otherwise, storage leaks in the microarchitecture • Typical recovery approach • Checkpoint free map at potential fault/mispeculation points • Recover the RS free map associated with recovery PC

Discussion Points • What about memory dependencies? • We can deallocate RS out-of-order (which improves RS utilization), why not allocate them out-of-order as well? • If we didn’t rename the registers, would the dynamic scheduler still work? • Could the wakeup-select-execute loop be reduced to a wakeup-select loop with parallel execute?

EECS 470

EECS 470

Presentation Transcript

EECS 470: Computer Architecture

EECS 470 Power and Architecture

EECS 470

EECS 470 Lecture 8

EECS 470 Lecture 8

Finishing out EECS 470

EECS 470

EECS 470

EECS 470 Lecture 1

EECS 470 Power and Architecture

EECS/CS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470