ECE 4100/6100 Advanced Computer Architecture Lecture 8 Dynamic Scheduling (II)

ECE 4100/6100Advanced Computer ArchitectureLecture 8 Dynamic Scheduling (II) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

Modern Processors • Branch Prediction results in speculative execution • Speculative instructions (if wrongly speculated) must not alter the architecture states • Architecture Registers • Memory • Requirement of precise exception/interrupts

ALLOC RAT RS ARF Modern Out-of-Order Core Reservation Station issues instructions to functional units Allocate instructions Reorder Buffer maintains state information (physical registers) for precise interrupts and speculative execution ROB Architectural register file LSQ Register Alias Table renames architecture registers Load Store Queue maintains memory access ordering

Physical Registers Original Code Renamed Code T0 T1 R2 = R1+R3 R4 = R2 - R6 … R2 = R7 / R5 BEQ R2, #1 … R2 = R4 * R1 R6 = Load [R2] T1 = R1+R3 R4 = T1 - R6 … T20 = R7 / R5 BEQ T20, #1 … T7 = R4 * R1 R6 = Load [T7] T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 T21 T22 T23 WAW WAR Tn-2 Tn-1 Register Renaming Architected Registers R0 R1 R2 R3 R4 R5 R6 R7 No False Dependencies! Sandy Bridge: 160 PRs for INT 144 PRs for FP Adapted from Prof. G. Loh’s Slides

Unmapped Physical Registers TagD Dest  TagD Register Renaming Dest = Src1 op Src2 Mapping Mechanism Src1  TagS1 Src2  TagS2 TagD = TagS1 op TagS2 Repeat for each instruction Adapted from Prof. G. Loh’s Slides

ROB (40 entries) RAT EAX EBX ECX EDX ESI EDI ESP EBP Data Status RRF P6 Style Register Renaming (So does HP-PA8000, PPC604) Register Alias Table (RAT) • Use a lookup table for renaming • One entry per architectural register • Each entry maps to the most recent version of the architectural register, could be in • Physical register file • Architectural register file

- 13 - - - - - - R5 = R4 – R1 T14 = R4 – T13 R1 = R1 * R5 - 13 - - - 14 - - T15, T16 T15 = T13 * T14 R2 = R5 / R1 - - 15 15 - 16 - - - - 14 14 - - - - T16 T16 = T14 / T15 RAT Example Free PRegs R0 R1 R2 R3 R4 R5 R6 R7 - - - - - - - - T13, T14, T15, T16 R1 = R2 + R3 T13 = R2 + R3 T14, T15, T16 Adapted from Prof. G. Loh’s Slides

T10 T31 T19 T6 From free register pool Superscalar Rename R1 = R2 + R3 R4 = R5 – R7 R3 = R0 / R2 R5 = Ld 12[R6] T16 T23 T39 T7 T14 T16 T5 X RAT Don’t rename immediates For N-wide superscalar: 2N RAT read-ports N RAT write-ports

This is the wrong version of R2 Should be using this version of R2 Intra-Group Dependencies R2 = R2 + R3 R4 = R5 – R7 R3 = R0 / R2 R5 = Ld 12[R6] T16 T23 T39 T7 T14 T16 T5 X RAT T10 T31 T19 T6 From free register pool

T16 T34 T10 T16 T31 T10 T31 T19 T10 T31 T19 T6 Result of sequential renaming From free register pool Intra-Group Dependencies R1 = R2 + R1 R2 = R1 – R2 R1 = R2 / R1 R1 = R2 >> R1 T16 T34 T34 T16 T16 T34 T16 T34 RAT Correct final renamed registers

Resolving Intra-Group Dependencies Inst 0 Intra-Group Dependency Checker Inst 1 Inst 2 Inst 3 RAT T0L T0R Src L Src R Dest T1L T1R T2L T2R From free register pool T3L T3R Pdst0 Pdst1 Pdst2 Adapted from Prof. G. Loh’s Slides

src0L src1L src0R src1R src2L src2R src3L src3R dst3 Pdst3 R1R R2L R2R R3L R3R R1L = = = = = = = = = = = = T1L T1R T2L T2R T3L T3R 0 1 Intra-Group Dependency Checking Pdst0 dst0 Pdst1 dst1 Pdst2 dst2 Adapted from Prof. G. Loh’s Slides

dst0 dst1 dst2 dst3 != != use pdst0 != Priority encoder != use pdst1 != != use pdst2 1 use pdst3 Mapping Selection R1 = R2 + R1 R2 = R1 – R2 R1 = R2 / R1 R1 = R2 >> R1 Only this mapping for R1 should be written into the RAT Condition: use mapping if instruction is last writer to the register Adapted from Prof. G. Loh’s Slides

Issue with Imprecise Interrupt • add instructions take one cycle • E.g., • Load (left side) induces a “data page fault”; • Add (right side) induces an “instruction page fault” • If out-of-order completion is allowed • r10, r12, (or r2, r4) … will be modified • Wrong values will be used by the re-issued load • Interrupt classes • Program interrupts (exceptions or traps) • External interrupts (asynchronous) lw r5, 8(r10) add r10, r9, r8 add r12, r10, r7 L1: add r3, r1, r2 add r4, r1, r4 add r2, r4, r4 End of Non-Resident Page X Instruction Page Fault Start of Resident Page X+1

Precise Interrupts • To reflect a sequential architecture model  Serially correct (think about a single issue, non-pipelined processor) • Keep “Precise State” of an execution • All instructions before the interrupted instruction must be completed • The state should appear as if no instruction issued after the interrupted instruction • The interrupted PC should be presented to the interrupt handler (restartable) • Similar to branch misprediction handling • Out-of-order execution makes the ordering hard • Undo what comes after an interrupt

Why Supporting Precise Interrupts • Need to maintain a precise state (for recovery) • Software debugging • I/O or timer interrupts • Virtual memory (page fault) • Instruction emulation • Virtual machines

Support Precise Interrupt • Buffer results • Can reconstruct the scenario (state) as sequential execution • Restart from saved PC with saved PC state

Reorder Buffer (ROB) [SmithPlezkun’85 ‘88] • Architecture Register File keeps “In-order state” • Reorder Buffer (ROB) • A circular buffer • Contains all in-flight instructions • buffers the “Lookahead state” • In-order allocation/deallocation with head/tail pointers • When an exception occurs • Halting instruction issues • Revert to in-order state using RF and discard ROB results • Also used for branch misprediction recovery • Pentium Pro/II/III integrates physical register file within ROB • Pentium 4 decouples ROB and physical register file

Reorder Buffer (with physical registers) Exp event Spec? Done? PC V RegDst Data (physical register) Head (oldest instruction) . . . . . . Tail (next inst to be allocated) Sandy Bridge : 168-entry ROB

Exp event Spec? Done? PC V RegDst Data (physical register) Head xA000 0000 R1 Tail . . . . . . Handling Precise Interrupts 0 11 R1=R1+10 1 0 0 1 xA004 1 0 0 0000 R2 R2=R2*2 xA008 1 0 0 0000 FR1 FR1=FR2/0.0 ARF R1 11 1 R2 2 1 R3 3 1 R4 4 1 1 R31

Exp event Spec? Done? PC V RegDst Data (physical register) Head Tail . . . . . . Handling Precise Interrupts 0 xA004 1 0 0 0000 R2 R2=R2*2 xA008 1 0 0 0000 FR1 FR1=FR2/0.0 xA00C R3=R3+1 1 0 0 0000 R3 ARF R1 11 1 R2 2 1 R3 3 1 R4 4 1 1 R31

Exp event Spec? Done? PC V RegDst Data (physical register) Head Tail . . . . . . Handling Precise Interrupts 0 xA004 1 0 0 0000 R2 R2=R2*2 xA008 1 0 0 0000 FR1 FR1=FR2/0.0 xA00C R3=R3+1 1 0 1 0000 R3 4 xA010 1 0 0 0000 R4 R4=R4*2 ARF R1 11 1 R2 2 1 R3 3 1 R4 4 1 1 R31

Exp event Spec? Done? PC V RegDst Data (physical register) Head Tail . . . . . . Handling Precise Interrupts 0 xA004 1 0 0 0000 R2 R2=R2*2 1 4 xA008 1 0 0 0010 FR1 FR1=FR2/0.0 xA00C R3=R3+1 1 0 1 0000 R3 4 xA010 1 0 1 0000 R4 8 R4=R4*2 xA014 1 0 0 0000 FR4 FR4=FR4*2.0 ARF R1 1 11 R2 4 2 1 R3 3 1 R4 4 1 1 R31

Exp event Spec? Done? PC V RegDst Data (physical register) Head xA004 1 0 1 0000 R2 R2=R2*2 4 Tail . . . . . . Handling Precise Interrupts 0 0 xA008 1 0 0 0010 FR1 FR1=FR2/0.0 xA00C R3=R3+1 1 0 1 0000 R3 4 xA010 1 0 1 0000 R4 8 R4=R4*2 xA014 1 0 0 0000 FR4 FR4=FR4*2.0 ARF R1 1 11 R2 4 1 R3 3 1 R4 4 1 1 R31

Exp event Spec? Done? PC V RegDst Data (physical register) Head Tail . . . . . . Exception detected. Handling Precise Interrupts These values were not committed into RF 0 0 xA008 1 0 0 0010 FR1 FR1=FR2/0.0 xA00C R3=R3+1 1 0 1 0000 R3 4 xA010 1 0 1 0000 R4 8 R4=R4*2 xA014 1 0 0 0000 FR4 FR4=FR4*2.0 ARF R1 1 11 R2 4 1 R3 3 1 R4 4 Back up “PC” and current RF 1 1 R31 Depending on the Exception, process will either abort or instruction will be resumed from this excepting instruction

Exp event Spec? Done? PC V RegDst Data (physical register) Head xB000 0000 R1 Tail . . . . . . Handling Speculative Execution R1=R1+10 1 0 0 xB004 1 0 0 0000 BEQ R1, R0, L1 ARF R1 1 R2 2 1 R3 3 1 R4 4 1 1 R31

Exp event Spec? Done? PC V RegDst Data (physical register) Head xB000 0000 R1 Tail . . . . . . Handling Speculative Execution R1=R1+10 1 0 0 xB004 1 0 0 0000 BEQ R1, R0, L1 xC100 1 1 1 0000 32 R2=R3 << 2 R2 xC104 1 1 0 0000 R1=R2*R3 R1 xD2AC 1 1 0 0000 BEQ R3, R0, L1 xD2B0 1 1 1 0000 R1 R1=R7+1 28 ARF R1 1 R2 2 1 R3 3 1 R4 4 1 1 R31 BEQ R1, R0, L1 is predicted TAKEN

Exp event Spec? Done? PC V RegDst Data (physical register) Head Tail . . . . . . Handling Speculative Execution BEQ Misprediction xB004 1 0 0 0000 BEQ R1, R0, L1 xC100 1 1 1 0000 32 R2=R3 << 2 R2 xC104 1 1 0 0000 R1=R2*R3 R1 xD2AC 1 1 0 0000 BEQ R3, R0, L1 xD2B0 1 1 1 0000 R1 R1=R7+1 28 ARF R1 11 R2 2 1 R3 3 1 R4 4 1 1 R31 BEQ R1, R0, L1 is resolved, actually NOT TAKEN !!

Exp event Spec? Done? PC V RegDst Data (physical register) Head xB004 Tail 1 0 0 0000 BEQ R1, R0, L1 . . . . . . Handling Speculative Execution ARF R1 11 R2 2 1 R3 3 1 R4 4 1 1 R31 Retire branch, Clear all entries after the mis-speculated branch

Exp event Spec? Done? PC V RegDst Data (physical register) Head Tail . . . . . . Handling Speculative Execution xB008 1 0 0 0000 R2=R5 << 4 R2 ARF R1 11 R2 2 1 R3 3 1 R4 4 1 1 R31 Continue execution from the correct path (Fall through in this case)

RAT Recovery ARF state corresponds to state prior to oldest non-committed instruction ARF br As instructions are processed, the RAT corresponds to the register mapping after the most recently renamed instruction RAT ?!? On a branch misprediction, wrong-path instructions are flushed from the machine The RAT is left with an invalid set of mappings corresponding to the wrong- path instruction state Adapted from Prof. G. Loh’s Slide

foo Solution: Stall and Drain Allow all instructions to execute and commit; ARF corresponds to last committed instruction ARF ARF now corresponds to the state right before the next instruction to be renamed (foo) br RAT X Reset RAT so that all mappings refer to the ARF ?!? Pros: Very simple to implement Cons: Performance loss due to stalls Resume renaming the new correct- path instructions from fetch Correct path instructions from fetch; can’t rename because RAT is wrong

foo Another Solution: Checkpointing At each branch, make a copy of the RAT (register mapping at the time of the branch) ARF br Checkpoint Free Pool br RAT RAT RAT RAT br RAT br On a misprediction: 1. flush wrong-path instructions 2. deallocate RAT checkpoints 3. recover RAT from checkpoint 4. resume renaming

Modern Instruction Scheduler • At dispatch, instruction read all available operands from the register files and store a copy in the scheduler (Tomasulo’s algorithm) • Unavailable operands will be “captured” from the functional unit outputs (CDB broadcast) • When ready, instructions can issue directly from the scheduler without reading additional operands from any other register files (Wakeup and select) Fetch & Dispatch Fetch & Dispatch Fetch & Dispatch ARF ARF ARF PRF/ROB PRF/ROB Physical register update Instruction Scheduler Bypass Functional Units Adapted from Prof. G. Loh’s Slide

Instruction Scheduling: Wakeup and Select • Wakeup Logic • To notify the resolution of data dependency of input operands • Wake up instructions with zero input dependency • Select Logic • Choose and fire ready instructions • Deal with structure hazard • Wakeup-select is likely on the critical path • Associative match

Scalar Scheduler (Issue Width = 1) Select Logic Tag Broadcast Bus T14 = T39 To Execute Logic T16 = T39 T8 = T6 = = T42 T17 = T39 = T15 T17 = T39 From Prof. G. Loh’s Slide

T14 = = = = = = = = = = = = = = = = T16 Superscalar Scheduler (Issue Width = 4) Tag Broadcast Bus [3..0] Select Logic T39 To Execute Logic T39 T8 T6 T42 T17 = = = = T39 = = = = T17 T15 = = = = T39 = = = = Snapshot of RS (only 4 entries shown) Adapted from Prof. G. Loh’s Slide

Selection Logic • Select ready instructions to be issued • Goal: to reduce the height of DFG • Methods • Location-based (e.g., leftmost ready first) • Allow simple, faster hardware • Oldest ready first • Can use location-based (in-order issue) with “compaction” • Can be slow and complex

Grant3 Grant3 Req0 Req0 Grant0 Grant0 Req1 Req1 Grant1 Grant1 Req2 Req2 Grant02 Grant02 Req3 Req3 Tree-like Arbitrated Selection Logic AnyQueue AnyQueue Enable Enable Grant3 Grant3 Req0 Req0 Grant0 Grant0 Req1 Req1 Grant1 Grant1 Req2 Req2 Grant02 Grant02 Req3 Req3 AnyQueue AnyQueue Enable Enable Simple Select Logic Implementation Reservation Station 1 [Palarchala ISCA’97]

Grant3 Grant3 Req0 Req0 Grant0 Grant0 Req1 Req1 Grant1 Grant1 Req2 Req2 Grant02 Grant02 Req3 Req3 AnyQueue AnyQueue Enable Enable Req0 Req1 Req2 Req3 Grt0 Grt1 Grt2 Grt3 Grant3 Grant3 Req0 Req0 Grant0 Grant0 Req1 Req1 Grant1 Grant1 Req2 Req2 Grant02 Grant02 Req3 Req3 AnyQueue AnyQueue Enable Enable Priority Decoder AnyQueue Enable Simple Select Logic Implementation Reservation Station 1 [Palarchala ISCA’97]

Grant3 Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 AnyQueue Enable Simple Select Logic Implementation Reservation Station Grant3 Grant3 Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 AnyQueue Enable AnyQueue Enable Grant3 Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 AnyQueue Enable 1 [Palarchala ISCA’97]

Reservation Station Reservation Station Issues to Distinctive Functional Units Distributed Instruction Windows (e.g., MIPS R1000 or Alpha 21264) Faster to have separate instruction schedulers for different instruction types

Req0 Req1 Req2 Req3 Grant0 Grant1 Grant2 Grant3 Dual Issues to Multiple Units (e.g., 2 Adders) Req0 Req1 Req2 Req3 Grant0 Grant1 Grant2 Grant3 [Palarchala Dissertation]

Memory Disambiguation • Can we “undo” stores? • Stores cannot be committed to memory until they are marked ready to retire • Completed stores are queued and waiting in a store queue or store buffer • Disambiguate (and resolve) memory dependency dynamically

Memory Ordering • Load X bypassing Load X violates certain memory consistency model (e.g., sequential consistency) • Load-load order trap replays Source: Alpha 21264 HRM

ALLOC RS Load Store Queue (LSQ) • Memory instructions are allocated into LSQ in program order • LSQ manages memory reference ordering • Unified LSQ vs. Split LSQ • Sandy Bridge: 64 Load buffers, 36 Store buffers Age-ordered ROB Store Queue Load Queue Split LSQ

Issued to Memory for execution 1 0 0 1 2 2 C D A 0 2 ??? Issuing a Load for Execution • Each load checks against older stores • Associative search • A performance issue of scalability Issued? Issued? age address age address data 1 1 A 00000001 1 1 B 12340000 0 1 C FFFF1111 FFFFFF00 Store Queue Load Queue

0 1 1 2 2 1 C D A 0 2 ??? Store-to-load forwarding Issuing a Load for Execution • Implementation dependent: comprehensive size matching can be prohibitively expensive • Simple method: forward when a larger store (word) precedes a smaller load (half) Issued? Issued? age address age address data 1 1 A 00000001 1 1 B 12340000 0 1 C FFFF1111 FFFFFF00 Store Queue Load Queue

ECE 4100/6100 Advanced Computer Architecture Lecture 8 Dynamic Scheduling (II)