Lecture 5. Dynamic Scheduling II

COM515 Advanced Computer Architecture Lecture 5. Dynamic Scheduling II Prof. Taeweon Suh Computer Science Education Korea University

Modern Processors • Branch Prediction results in speculative execution • Speculative instructions (if wrongly speculated) must not alter the architecture states • Architecture Registers • Memory • Requirement of precise exception/interrupts Prof. Sean Lee’s Slide

ALLOC RAT RS ARF Modern Out-of-Order Core Reservation Station issues instructions to functional units Allocate instructions Reorder Buffer maintains state information (physical registers) for precise interrupts and speculative execution ROB Architectural register file LSQ Register Alias Table renames architecture registers Load Store Queue maintains memory access ordering Prof. Sean Lee’s Slide

Physical Registers Original Code Renamed Code T0 T1 R2 = R1+R3 R4 = R2 - R6 … R2 = R7 / R5 BEQ R2, #1 … R2 = R4 * R1 R6 = Load [R2] T1 = R1+R3 R4 = T1 - R6 … T20 = R7 / R5 BEQ T20, #1 … T7 = R4 * R1 R6 = Load [T7] T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 T21 T22 T23 WAW Tn-2 Tn-1 WAR Register Renaming Architectural Registers R0 R1 R2 R3 R4 R5 R6 R7 No False Dependencies! Sandy Bridge: 160 PRs for INT 144 PRs for FP Adapted from Prof. G. Loh’s Slides

Unmapped Physical Registers TagD Dest  TagD Register Renaming Dest = Src1 op Src2 Mapping Mechanism Src1  TagS1 Src2  TagS2 TagD = TagS1 op TagS2 Repeat for each instruction Adapted from Prof. G. Loh’s Slides

ROB (40 entries) RAT EAX EBX ECX EDX ESI EDI ESP EBP Data Status RRF (Retirement Register File) P6 Style Register Renaming (So does HP-PA8000, PPC604) Register Alias Table (RAT) • Use a lookup table for renaming • One entry per architectural register • Each entry maps to the most recent version of the architectural register, could be in • Physical register file • Architectural register file Prof. Sean Lee’s Slide

- 13 - - - - - - - 13 - - - 14 - - - - 15 15 16 - - - - - 14 14 - - - - RAT Example Free Physical Regs R0 R1 R2 R3 R4 R5 R6 R7 - - - - - - - - T13, T14, T15, T16 R1 = R2 + R3 T13 = R2 + R3 T14, T15, T16 R5 = R4 – R1 T14 = R4 – T13 R1 = R1 * R5 T15, T16 T15 = T13 * T14 R2 = R5 / R1 T16 T16 = T14 / T15 Adapted from Prof. G. Loh’s Slides

T10 T31 T19 T6 From free register pool Superscalar Rename T16 T39 T14 T5 R1 = R2 + R3 R4 = R5 – R7 R3 = R0 / R2 R5 = Ld 12[R6] RAT T23 T7 T16 X Don’t rename immediates For N-wide superscalar: 2N RAT read-ports N RAT write-ports Prof. Sean Lee’s Slide

This is the wrong version of R2 Should be using this version of R2 Intra-Group Dependencies T16 T39 T14 T5 R2 = R2 + R3 R4 = R5 – R7 R3 = R0 / R2 R5 = Ld 12[R6] RAT T23 T7 T16 X T10 T31 T19 T6 From free register pool Prof. Sean Lee’s Slide

T16 T34 T10 T16 T31 T10 T31 T19 T10 T31 T19 T6 Result of sequential renaming From free register pool Intra-Group Dependencies R1 = R2 + R1 R2 = R1 – R2 R1 = R2 / R1 R1 = R2 >> R1 T16 T34 T34 T16 T16 T34 T16 T34 RAT Correct final renamed registers Modified from Prof. Sean Lee’s Slide

Resolving Intra-Group Dependencies Inst 0 Intra-Group Dependency Checker Inst 1 Inst 2 Inst 3 RAT T0L T0R Src L Src R Dest T1L T1R T2L T2R From free register pool T3L T3R Pdst0 Pdst1 Pdst2 Adapted from Prof. G. Loh’s Slides

src0L src1L src0R src1R src2L src2R src3L src3R dst3 Pdst3 R1R R2L R2R R3L R3R R1L = = = = = = = = = = = = T1L T1R T2L T2R T3L T3R 0 1 Intra-Group Dependency Checking Pdst0 dst0 Pdst1 dst1 Pdst2 dst2 Adapted from Prof. G. Loh’s Slides

dst0 dst1 dst2 dst3 != != use pdst0 != != use pdst1 != != use pdst2 1 use pdst3 Mapping Selection R1 = R2 + R1 R2 = R1 – R2 R1 = R2 / R1 R1 = R2 >> R1 Only this mapping for R1 should be written into the RAT Condition: use mapping if instruction is last writer to the register Adapted from Prof. G. Loh’s Slides

Issue with Imprecise Interrupt • add instructions take one cycle • E.g., • Load (left side) induces a “data page fault”; • If out-of-order completion is allowed • R10 and r12 will be modified • Wrong values will be used by the re-issued load • Interrupt classes • Program interrupts (exceptions or traps) • External interrupts (asynchronous) lw r5, 8(r10) add r10, r9, r8 add r12, r10, r7 Modified from Prof. Sean Lee’s Slide

Precise Interrupts • To reflect a sequential architecture model  Serially correct (think about a single issue, non-pipelined processor) • Keep “Precise State” of an execution • All instructions before the interrupted instruction must be completed • The state should appear as if no instruction issued after the interrupted instruction • The interrupted PC should be presented to the interrupt handler (restartable) • Similar to branch misprediction handling • Out-of-order execution makes the ordering hard • Undo what comes after an interrupt Prof. Sean Lee’s Slide

Why Support Precise Interrupts • Need to maintain a precise state (for recovery) • Software debugging • I/O or timer interrupts • Virtual memory (page fault) • Instruction emulation • Virtual machines Prof. Sean Lee’s Slide

Support Precise Interrupt • Buffer results • Can reconstruct the scenario (state) as sequential execution • Restart from saved PC with saved PC state Prof. Sean Lee’s Slide

Reorder Buffer (ROB) [SmithPlezkun’85 ‘88] • Architecture Register File keeps “In-order state” • Reorder Buffer (ROB) • A circular buffer • Contains all in-flight instructions • buffers the “Lookahead state” • In-order allocation/deallocation with head/tail pointers • When an exception occurs • Halt instruction issues • Revert to in-order state using RF and discard ROB results • Also used for branch misprediction recovery • Pentium Pro/II/III integrates physical register file within ROB • Pentium 4 decouples ROB and physical register file Modified from Prof. Sean Lee’s Slide

ROB (with physical registers) Exp event Spec? Done? PC V RegDst Data (physical register) Head (oldest instruction) … … Tail (next inst to be allocated) Prof. Sean Lee’s Slide Sandy Bridge : 168-entry ROB

Exp event Spec? Done? PC V RegDst Data (physical register) Head xA000 0000 R1 Tail Handling Precise Interrupts 0 11 R1=R1+10 1 0 0 1 xA004 1 0 0 0000 R2 R2=R2*2 xA008 1 0 0 0000 FR1 FR1=FR2/0.0 ARF R1 11 1 … … R2 2 1 R3 3 1 R4 4 1 1 R31 Prof. Sean Lee’s Slide

Exp event Spec? Done? PC V RegDst Data (physical register) Head Tail Handling Precise Interrupts 0 xA004 1 0 0 0000 R2 R2=R2*2 xA008 1 0 0 0000 FR1 FR1=FR2/0.0 xA00C R3=R3+1 1 0 0 0000 R3 ARF R1 1 11 … … R2 2 1 R3 3 1 R4 4 1 1 R31 Prof. Sean Lee’s Slide

Exp event Spec? Done? PC V RegDst Data (physical register) Head Tail Handling Precise Interrupts 0 xA004 1 0 0 0000 R2 R2=R2*2 xA008 1 0 0 0000 FR1 FR1=FR2/0.0 xA00C R3=R3+1 1 0 1 0000 R3 4 xA010 1 0 0 0000 R4 R4=R4*2 ARF R1 1 11 … … R2 2 1 R3 3 1 R4 4 1 1 R31 Prof. Sean Lee’s Slide

Exp event Spec? Done? PC V RegDst Data (physical register) Head Tail Handling Precise Interrupts 0 xA004 1 0 0 0000 R2 R2=R2*2 1 4 xA008 1 0 0 0010 FR1 FR1=FR2/0.0 xA00C R3=R3+1 1 0 1 0000 R3 4 xA010 1 0 1 0000 R4 8 R4=R4*2 xA014 1 0 0 0000 FR4 FR4=FR4*2.0 ARF R1 1 11 … … R2 4 2 1 R3 3 1 R4 4 1 1 R31 Prof. Sean Lee’s Slide

Exp event Spec? Done? PC V RegDst Data (physical register) Head xA004 1 0 1 0000 R2 R2=R2*2 4 Tail Handling Precise Interrupts 0 0 xA008 1 0 0 0010 FR1 FR1=FR2/0.0 xA00C R3=R3+1 1 0 1 0000 R3 4 xA010 1 0 1 0000 R4 8 R4=R4*2 xA014 1 0 0 0000 FR4 FR4=FR4*2.0 ARF R1 11 1 … … R2 4 1 R3 3 1 R4 4 1 1 R31 Prof. Sean Lee’s Slide

Exp event Spec? Done? PC V RegDst Data (physical register) Head Tail Exception detected. Handling Precise Interrupts These values were not committed into RF 0 0 xA008 1 0 0 0010 FR1 FR1=FR2/0.0 xA00C R3=R3+1 1 0 1 0000 R3 4 xA010 1 0 1 0000 R4 8 R4=R4*2 xA014 1 0 0 0000 FR4 FR4=FR4*2.0 ARF R1 1 11 … … R2 4 1 R3 3 1 R4 4 Back up “PC” and current RF 1 1 R31 Depending on the Exception, process will either abort or instruction will be resumed from this excepting instruction Prof. Sean Lee’s Slide

Exp event Spec? Done? PC V RegDst Data (physical register) Head xB000 0000 R1 Tail Handling Speculative Execution R1=R1+10 1 0 0 xB004 1 0 0 0000 BEQ R1,R0,L1 ARF R1 1 … … R2 2 1 R3 3 1 R4 4 1 1 R31 Prof. Sean Lee’s Slide

Exp event Spec? Done? PC V RegDst Data (physical register) Head xB000 0000 R1 Tail Handling Speculative Execution R1=R1+10 1 0 0 xB004 1 0 0 0000 BEQ R1,R0,L1 xC100 1 1 1 0000 12 R2=R3<<2 R2 xC104 1 1 0 0000 R1=R2*R3 R1 xC108 1 1 0 0000 BEQ R3,R0,L1 xD2B0 1 1 1 0000 R1 R1=R7+1 8 ARF R1 1 … … R2 2 1 R3 3 1 R4 4 1 1 R31 BEQ R1, R0, L1 is predicted TAKEN Modified from Prof. Sean Lee’s Slide

Exp event Spec? Done? PC V RegDst Data (physical register) Head Tail Handling Speculative Execution BEQ Misprediction xB004 1 0 0 0000 BEQ R1,R0,L1 xC100 1 1 1 0000 12 R2=R3<<2 R2 xC104 1 1 0 0000 R1=R2*R3 R1 xD2AC 1 1 0 0000 BEQ R3,R0,L1 xD2B0 1 1 1 0000 R1 R1=R7+1 8 ARF R1 11 … … R2 2 1 R3 3 1 R4 4 1 1 R31 BEQ R1, R0, L1 is resolved, actually NOT TAKEN !! Prof. Sean Lee’s Slide

Exp event Spec? Done? PC V RegDst Data (physical register) Head xB004 Tail 1 0 0 0000 BEQ R1,R0,L1 Handling Speculative Execution ARF R1 11 … … R2 2 1 R3 3 1 R4 4 1 1 R31 Retire branch, Clear all entries after the mis-speculated branch Prof. Sean Lee’s Slide

Exp event Spec? Done? PC V RegDst Data (physical register) Head Tail Handling Speculative Execution xB008 1 0 0 0000 R2=R5<<4 R2 ARF R1 11 … … R2 2 1 R3 3 1 R4 4 1 1 R31 Continue execution from the correct path (Fall through in this case) Prof. Sean Lee’s Slide

RAT Recovery ARF state corresponds to state prior to oldest non-committed instruction ARF As instructions are processed, the RAT corresponds to the register mapping after the most recently renamed instruction br RAT ?!? On a branch misprediction, wrong-path instructions are flushed from the machine The RAT is left with an invalid set of mappings corresponding to the wrong- path instruction state Adapted from Prof. G. Loh’s Slide

foo Solution: Stall and Drain Allow all instructions to execute and commit; ARF corresponds to last committed instruction ARF ARF now corresponds to the state right before the next instruction to be renamed (foo) br RAT X Reset RAT so that all mappings refer to the ARF ?!? • Pros: Very simple to implement • Cons: Performance loss due to stalls Correct path instructions from fetch; can’t rename because RAT is wrong Resume renaming the new correct- path instructions from fetch Prof. Sean Lee’s Slide

foo Another Solution: Checkpointing At each branch, make a copy of the RAT (register mapping at the time of the branch) ARF br br RAT RAT Checkpoint Free Pool RAT RAT br RAT br On a misprediction: 1. flush wrong-path instructions 2. deallocate RAT checkpoints 3. recover RAT from checkpoint 4. resume renaming Prof. Sean Lee’s Slide

Modern Instruction Scheduler • At dispatch, instruction read all available operands from the register files and store a copy in the scheduler (Tomasulo’s algorithm) • Unavailable operands will be “captured” from the functional unit outputs (CDB broadcast) • When ready, instructions can issue directly from the scheduler without reading additional operands from any other register files (Wakeup and select) Fetch & Dispatch Fetch & Dispatch Fetch & Dispatch ARF ARF ARF PRF/ROB PRF/ROB Physical register update Instruction Scheduler Bypass Functional Units Adapted from Prof. G. Loh’s Slide

Instruction Scheduling: Wakeup and Select • Wakeup Logic • To notify the resolution of data dependency of input operands • Wake up instructions with zero input dependency • Select Logic • Choose and fire ready instructions • Deal with structure hazard • Wakeup-select is likely on the critical path • Associative match Prof. Sean Lee’s Slide

Scalar Scheduler (Issue Width = 1) Select Logic Tag Broadcast Bus T14 = T39 To Execute Logic T16 = T39 T8 = T6 = = T42 T17 = T39 = T15 T17 = T39 From Prof. G. Loh’s Slide

T14 = = = = = = = = = = = = = = = = T16 Superscalar Scheduler (Issue Width = 4) Tag Broadcast Bus [3..0] Select Logic T39 To Execute Logic T39 T8 T6 T42 T17 = = = = T39 = = = = T17 T15 = = = = T39 = = = = Snapshot of RS (only 4 entries shown) Adapted from Prof. G. Loh’s Slide

Selection Logic • Select ready instructions to be issued • Goal: to reduce the height of DFG • Methods • Location-based (e.g., leftmost ready first) • Allow simple, faster hardware • Oldest ready first • Can use location-based (in-order issue) with “compaction” • Compact the issue window to the left every time instructions are issued and by inserting new instructions at the right end • Can be slow and complex Prof. Sean Lee’s Slide

Grant3 Grant3 Req0 Req0 Grant0 Grant0 Req1 Req1 Grant1 Grant1 Req2 Req2 Grant02 Grant02 Req3 Req3 Tree-like Arbitrated Selection Logic AnyReq Enable AnyReq Enable Grant3 Grant3 Req0 Req0 Grant0 Grant0 Req1 Req1 Grant1 Grant1 Req2 Req2 Grant02 Grant02 Req3 Req3 AnyReq AnyReq Enable Enable Simple Select Logic Implementation Reservation Station Leftmost ready first • The Enable signal to the root cell is high whenever the functional unit is ready to execute an instruction • The AnyReq signal is raised if any of the input Req signals is high 1 Modified from Prof. Sean Lee’s Slide [Palarchala Dissertation]

Grant3 Grant3 Req0 Req0 Grant0 Grant0 Req1 Req1 Grant1 Grant1 Req2 Req2 Grant02 Grant02 Req3 Req3 AnyReq AnyReq Enable Enable Req0 Req1 Req2 Req3 Grt0 Grt1 Grt2 Grt3 Grant3 Grant3 Req0 Req0 Grant0 Grant0 Req1 Req1 Grant1 Grant1 Req2 Req2 Grant02 Grant02 Req3 Req3 AnyReq AnyReq Enable Enable Priority Decoder AnyReq Enable Simple Select Logic Implementation Reservation Station 1 Prof. Sean Lee’s Slide [Palarchala Dissertation]

Grant3 Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 AnyReq Enable Simple Select Logic Implementation Reservation Station Grant3 Grant3 Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 AnyReq Enable AnyReq Enable Multiple Ready Instruction Request Grant3 Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 AnyReq Enable 1 Prof. Sean Lee’s Slide [Palarchala Dissertation]

Grant3 Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 AnyReq Enable Simple Select Logic Implementation Reservation Station Grant3 Grant3 Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 AnyReq Enable AnyReq Enable Selective Issue for One FU Grant3 Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 AnyReq Enable 1 Prof. Sean Lee’s Slide [Palarchala Dissertation]

Reservation Station Reservation Station Issues to Distinctive Functional Units Distributed Instruction Windows (e.g., MIPS R1000 or Alpha 21264) Integer Unit FPU Faster to have separate instruction schedulers for different instruction types Prof. Sean Lee’s Slide

Req0 Req1 Req2 Req3 Grant0 Grant1 Grant2 Grant3 Dual Issues to Multiple Units (e.g., 2 Adders) Req0 Req1 Req2 Req3 Selection Logic for Adder0 Grant0 Grant1 Grant2 Grant3 Selection Logic for Adder1 Prof. Sean Lee’s Slide [Palarchala Dissertation]

Memory Disambiguation • Can we “undo” stores? • Stores cannot be committed to memory until they are marked ready to retire • Completed stores are queued and waiting in a store queue or store buffer • Disambiguate (and resolve) memory dependency dynamically Prof. Sean Lee’s Slide

Memory Ordering • Load X bypassing Load X violates certain memory consistency model (e.g., sequential consistency) • Load-load order trap replays Source: Alpha 21264 HRM Prof. Sean Lee’s Slide

ALLOC RS Load Store Queue (LSQ) • Memory instructions are allocated into LSQ in program order • LSQ manages memory reference ordering • Unified LSQ vs. Split LSQ • Sandy Bridge: 64 Load buffers, 36 Store buffers Age-ordered ROB Store Queue Load Queue Split LSQ Prof. Sean Lee’s Slide

1 0 0 2 1 2 D C A 0 2 ??? Issued to Memory for execution Issuing a Load for Execution • Each load checks against older stores • Associative search • A performance issue of scalability Issued? Issued? age address age address data 1 1 A 00000001 1 1 B 12340000 0 1 C FFFF1111 FFFFFF00 Load Queue Store Queue Prof. Sean Lee’s Slide

0 1 1 2 2 1 C D A 0 2 ??? Store-to-load forwarding Issuing a Load for Execution • Implementation dependent: comprehensive size matching can be prohibitively expensive • Simple method: forward when a larger store (word) precedes a smaller load (half) Issued? Issued? age address age address data 1 1 A 00000001 1 1 B 12340000 0 1 C FFFF1111 FFFFFF00 Load Queue Store Queue Prof. Sean Lee’s Slide

0 1 1 1 3 2 2 1 C D A K Issuing a Load for Execution • Can speculatively issue loads for shortening latency (Alpha 21264, Pentium 4 (Prescott)) • Store, when address ready, checks newer loads in the Load Queue • “Replay” needed if speculation turns out to be incorrect (e.g. Alpha’s store-load replay) Issued? Issued? age address age address data 1 1 A 00000001 1 1 B 12340000 Speculatively issue for execution 0 1 C FFFF1111 FFFFFF00 0 2 ??? Load Queue Store Queue Modified from Prof. Sean Lee’s Slide

Lecture 5. Dynamic Scheduling II

Lecture 5. Dynamic Scheduling II

Presentation Transcript

Lecture 5 Dynamic Programming

Lecture 5: Uniprocessor Scheduling

Dynamic Scheduling

Dynamic Scheduling

Dynamic scheduling

LECTURE 11: Dynamic programming - II -

Lecture 15: Dynamic Scheduling Denouement

Lecture 5: Scheduling and Reliability

Lecture 5. Dynamic Scheduling I

Operating Systems Lecture 16 Scheduling II

Lecture 5 Dynamic Programming

Dynamic Scheduling

Dynamic Scheduling

Lecture 8: Modern Dynamic Instruction Scheduling

Lecture 9 Dynamic Scheduling of Pipeline

Tomasulo Dynamic Scheduling

L14: Dynamic Scheduling

Lecture 5 Process Scheduling (chapter 5)

Dynamic scheduling

ECE 4100/6100 Advanced Computer Architecture Lecture 8 Dynamic Scheduling (II)

L15: Dynamic Scheduling

Lecture 5: Uniprocessor Scheduling