1 / 56

Lecture 5. Dynamic Scheduling II

COM515 Advanced Computer Architecture. Lecture 5. Dynamic Scheduling II. Prof. Taeweon Suh Computer Science Education Korea University. Modern Processors. Branch Prediction results in speculative execution

jared
Download Presentation

Lecture 5. Dynamic Scheduling II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COM515 Advanced Computer Architecture Lecture 5. Dynamic Scheduling II Prof. Taeweon Suh Computer Science Education Korea University

  2. Modern Processors • Branch Prediction results in speculative execution • Speculative instructions (if wrongly speculated) must not alter the architecture states • Architecture Registers • Memory • Requirement of precise exception/interrupts Prof. Sean Lee’s Slide

  3. ALLOC RAT RS ARF Modern Out-of-Order Core Reservation Station issues instructions to functional units Allocate instructions Reorder Buffer maintains state information (physical registers) for precise interrupts and speculative execution ROB Architectural register file LSQ Register Alias Table renames architecture registers Load Store Queue maintains memory access ordering Prof. Sean Lee’s Slide

  4. Physical Registers Original Code Renamed Code T0 T1 R2 = R1+R3 R4 = R2 - R6 … R2 = R7 / R5 BEQ R2, #1 … R2 = R4 * R1 R6 = Load [R2] T1 = R1+R3 R4 = T1 - R6 … T20 = R7 / R5 BEQ T20, #1 … T7 = R4 * R1 R6 = Load [T7] T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 T21 T22 T23 WAW Tn-2 Tn-1 WAR Register Renaming Architectural Registers R0 R1 R2 R3 R4 R5 R6 R7 No False Dependencies! Sandy Bridge: 160 PRs for INT 144 PRs for FP Adapted from Prof. G. Loh’s Slides

  5. Unmapped Physical Registers TagD Dest  TagD Register Renaming Dest = Src1 op Src2 Mapping Mechanism Src1  TagS1 Src2  TagS2 TagD = TagS1 op TagS2 Repeat for each instruction Adapted from Prof. G. Loh’s Slides

  6. ROB (40 entries) RAT EAX EBX ECX EDX ESI EDI ESP EBP Data Status RRF (Retirement Register File) P6 Style Register Renaming (So does HP-PA8000, PPC604) Register Alias Table (RAT) • Use a lookup table for renaming • One entry per architectural register • Each entry maps to the most recent version of the architectural register, could be in • Physical register file • Architectural register file Prof. Sean Lee’s Slide

  7. - 13 - - - - - - - 13 - - - 14 - - - - 15 15 16 - - - - - 14 14 - - - - RAT Example Free Physical Regs R0 R1 R2 R3 R4 R5 R6 R7 - - - - - - - - T13, T14, T15, T16 R1 = R2 + R3 T13 = R2 + R3 T14, T15, T16 R5 = R4 – R1 T14 = R4 – T13 R1 = R1 * R5 T15, T16 T15 = T13 * T14 R2 = R5 / R1 T16 T16 = T14 / T15 Adapted from Prof. G. Loh’s Slides

  8. T10 T31 T19 T6 From free register pool Superscalar Rename T16 T39 T14 T5 R1 = R2 + R3 R4 = R5 – R7 R3 = R0 / R2 R5 = Ld 12[R6] RAT T23 T7 T16 X Don’t rename immediates For N-wide superscalar: 2N RAT read-ports N RAT write-ports Prof. Sean Lee’s Slide

  9. This is the wrong version of R2 Should be using this version of R2 Intra-Group Dependencies T16 T39 T14 T5 R2 = R2 + R3 R4 = R5 – R7 R3 = R0 / R2 R5 = Ld 12[R6] RAT T23 T7 T16 X T10 T31 T19 T6 From free register pool Prof. Sean Lee’s Slide

  10. T16 T34 T10 T16 T31 T10 T31 T19 T10 T31 T19 T6 Result of sequential renaming From free register pool Intra-Group Dependencies R1 = R2 + R1 R2 = R1 – R2 R1 = R2 / R1 R1 = R2 >> R1 T16 T34 T34 T16 T16 T34 T16 T34 RAT Correct final renamed registers Modified from Prof. Sean Lee’s Slide

  11. Resolving Intra-Group Dependencies Inst 0 Intra-Group Dependency Checker Inst 1 Inst 2 Inst 3 RAT T0L T0R Src L Src R Dest T1L T1R T2L T2R From free register pool T3L T3R Pdst0 Pdst1 Pdst2 Adapted from Prof. G. Loh’s Slides

  12. src0L src1L src0R src1R src2L src2R src3L src3R dst3 Pdst3 R1R R2L R2R R3L R3R R1L = = = = = = = = = = = = T1L T1R T2L T2R T3L T3R 0 1 Intra-Group Dependency Checking Pdst0 dst0 Pdst1 dst1 Pdst2 dst2 Adapted from Prof. G. Loh’s Slides

  13. dst0 dst1 dst2 dst3 != != use pdst0 != != use pdst1 != != use pdst2 1 use pdst3 Mapping Selection R1 = R2 + R1 R2 = R1 – R2 R1 = R2 / R1 R1 = R2 >> R1 Only this mapping for R1 should be written into the RAT Condition: use mapping if instruction is last writer to the register Adapted from Prof. G. Loh’s Slides

  14. Issue with Imprecise Interrupt • add instructions take one cycle • E.g., • Load (left side) induces a “data page fault”; • If out-of-order completion is allowed • R10 and r12 will be modified • Wrong values will be used by the re-issued load • Interrupt classes • Program interrupts (exceptions or traps) • External interrupts (asynchronous) lw r5, 8(r10) add r10, r9, r8 add r12, r10, r7 Modified from Prof. Sean Lee’s Slide

  15. Precise Interrupts • To reflect a sequential architecture model  Serially correct (think about a single issue, non-pipelined processor) • Keep “Precise State” of an execution • All instructions before the interrupted instruction must be completed • The state should appear as if no instruction issued after the interrupted instruction • The interrupted PC should be presented to the interrupt handler (restartable) • Similar to branch misprediction handling • Out-of-order execution makes the ordering hard • Undo what comes after an interrupt Prof. Sean Lee’s Slide

  16. Why Support Precise Interrupts • Need to maintain a precise state (for recovery) • Software debugging • I/O or timer interrupts • Virtual memory (page fault) • Instruction emulation • Virtual machines Prof. Sean Lee’s Slide

  17. Support Precise Interrupt • Buffer results • Can reconstruct the scenario (state) as sequential execution • Restart from saved PC with saved PC state Prof. Sean Lee’s Slide

  18. Reorder Buffer (ROB) [SmithPlezkun’85 ‘88] • Architecture Register File keeps “In-order state” • Reorder Buffer (ROB) • A circular buffer • Contains all in-flight instructions • buffers the “Lookahead state” • In-order allocation/deallocation with head/tail pointers • When an exception occurs • Halt instruction issues • Revert to in-order state using RF and discard ROB results • Also used for branch misprediction recovery • Pentium Pro/II/III integrates physical register file within ROB • Pentium 4 decouples ROB and physical register file Modified from Prof. Sean Lee’s Slide

  19. ROB (with physical registers) Exp event Spec? Done? PC V RegDst Data (physical register) Head (oldest instruction) … … Tail (next inst to be allocated) Prof. Sean Lee’s Slide Sandy Bridge : 168-entry ROB

  20. Exp event Spec? Done? PC V RegDst Data (physical register) Head xA000 0000 R1 Tail Handling Precise Interrupts 0 11 R1=R1+10 1 0 0 1 xA004 1 0 0 0000 R2 R2=R2*2 xA008 1 0 0 0000 FR1 FR1=FR2/0.0 ARF R1 11 1 … … R2 2 1 R3 3 1 R4 4 1 1 R31 Prof. Sean Lee’s Slide

  21. Exp event Spec? Done? PC V RegDst Data (physical register) Head Tail Handling Precise Interrupts 0 xA004 1 0 0 0000 R2 R2=R2*2 xA008 1 0 0 0000 FR1 FR1=FR2/0.0 xA00C R3=R3+1 1 0 0 0000 R3 ARF R1 1 11 … … R2 2 1 R3 3 1 R4 4 1 1 R31 Prof. Sean Lee’s Slide

  22. Exp event Spec? Done? PC V RegDst Data (physical register) Head Tail Handling Precise Interrupts 0 xA004 1 0 0 0000 R2 R2=R2*2 xA008 1 0 0 0000 FR1 FR1=FR2/0.0 xA00C R3=R3+1 1 0 1 0000 R3 4 xA010 1 0 0 0000 R4 R4=R4*2 ARF R1 1 11 … … R2 2 1 R3 3 1 R4 4 1 1 R31 Prof. Sean Lee’s Slide

  23. Exp event Spec? Done? PC V RegDst Data (physical register) Head Tail Handling Precise Interrupts 0 xA004 1 0 0 0000 R2 R2=R2*2 1 4 xA008 1 0 0 0010 FR1 FR1=FR2/0.0 xA00C R3=R3+1 1 0 1 0000 R3 4 xA010 1 0 1 0000 R4 8 R4=R4*2 xA014 1 0 0 0000 FR4 FR4=FR4*2.0 ARF R1 1 11 … … R2 4 2 1 R3 3 1 R4 4 1 1 R31 Prof. Sean Lee’s Slide

  24. Exp event Spec? Done? PC V RegDst Data (physical register) Head xA004 1 0 1 0000 R2 R2=R2*2 4 Tail Handling Precise Interrupts 0 0 xA008 1 0 0 0010 FR1 FR1=FR2/0.0 xA00C R3=R3+1 1 0 1 0000 R3 4 xA010 1 0 1 0000 R4 8 R4=R4*2 xA014 1 0 0 0000 FR4 FR4=FR4*2.0 ARF R1 11 1 … … R2 4 1 R3 3 1 R4 4 1 1 R31 Prof. Sean Lee’s Slide

  25. Exp event Spec? Done? PC V RegDst Data (physical register) Head Tail Exception detected. Handling Precise Interrupts These values were not committed into RF 0 0 xA008 1 0 0 0010 FR1 FR1=FR2/0.0 xA00C R3=R3+1 1 0 1 0000 R3 4 xA010 1 0 1 0000 R4 8 R4=R4*2 xA014 1 0 0 0000 FR4 FR4=FR4*2.0 ARF R1 1 11 … … R2 4 1 R3 3 1 R4 4 Back up “PC” and current RF 1 1 R31 Depending on the Exception, process will either abort or instruction will be resumed from this excepting instruction Prof. Sean Lee’s Slide

  26. Exp event Spec? Done? PC V RegDst Data (physical register) Head xB000 0000 R1 Tail Handling Speculative Execution R1=R1+10 1 0 0 xB004 1 0 0 0000 BEQ R1,R0,L1 ARF R1 1 … … R2 2 1 R3 3 1 R4 4 1 1 R31 Prof. Sean Lee’s Slide

  27. Exp event Spec? Done? PC V RegDst Data (physical register) Head xB000 0000 R1 Tail Handling Speculative Execution R1=R1+10 1 0 0 xB004 1 0 0 0000 BEQ R1,R0,L1 xC100 1 1 1 0000 12 R2=R3<<2 R2 xC104 1 1 0 0000 R1=R2*R3 R1 xC108 1 1 0 0000 BEQ R3,R0,L1 xD2B0 1 1 1 0000 R1 R1=R7+1 8 ARF R1 1 … … R2 2 1 R3 3 1 R4 4 1 1 R31 BEQ R1, R0, L1 is predicted TAKEN Modified from Prof. Sean Lee’s Slide

  28. Exp event Spec? Done? PC V RegDst Data (physical register) Head Tail Handling Speculative Execution BEQ Misprediction xB004 1 0 0 0000 BEQ R1,R0,L1 xC100 1 1 1 0000 12 R2=R3<<2 R2 xC104 1 1 0 0000 R1=R2*R3 R1 xD2AC 1 1 0 0000 BEQ R3,R0,L1 xD2B0 1 1 1 0000 R1 R1=R7+1 8 ARF R1 11 … … R2 2 1 R3 3 1 R4 4 1 1 R31 BEQ R1, R0, L1 is resolved, actually NOT TAKEN !! Prof. Sean Lee’s Slide

  29. Exp event Spec? Done? PC V RegDst Data (physical register) Head xB004 Tail 1 0 0 0000 BEQ R1,R0,L1 Handling Speculative Execution ARF R1 11 … … R2 2 1 R3 3 1 R4 4 1 1 R31 Retire branch, Clear all entries after the mis-speculated branch Prof. Sean Lee’s Slide

  30. Exp event Spec? Done? PC V RegDst Data (physical register) Head Tail Handling Speculative Execution xB008 1 0 0 0000 R2=R5<<4 R2 ARF R1 11 … … R2 2 1 R3 3 1 R4 4 1 1 R31 Continue execution from the correct path (Fall through in this case) Prof. Sean Lee’s Slide

  31. RAT Recovery ARF state corresponds to state prior to oldest non-committed instruction ARF As instructions are processed, the RAT corresponds to the register mapping after the most recently renamed instruction br RAT ?!? On a branch misprediction, wrong-path instructions are flushed from the machine The RAT is left with an invalid set of mappings corresponding to the wrong- path instruction state Adapted from Prof. G. Loh’s Slide

  32. foo Solution: Stall and Drain Allow all instructions to execute and commit; ARF corresponds to last committed instruction ARF ARF now corresponds to the state right before the next instruction to be renamed (foo) br RAT X Reset RAT so that all mappings refer to the ARF ?!? • Pros: Very simple to implement • Cons: Performance loss due to stalls Correct path instructions from fetch; can’t rename because RAT is wrong Resume renaming the new correct- path instructions from fetch Prof. Sean Lee’s Slide

  33. foo Another Solution: Checkpointing At each branch, make a copy of the RAT (register mapping at the time of the branch) ARF br br RAT RAT Checkpoint Free Pool RAT RAT br RAT br On a misprediction: 1. flush wrong-path instructions 2. deallocate RAT checkpoints 3. recover RAT from checkpoint 4. resume renaming Prof. Sean Lee’s Slide

  34. Modern Instruction Scheduler • At dispatch, instruction read all available operands from the register files and store a copy in the scheduler (Tomasulo’s algorithm) • Unavailable operands will be “captured” from the functional unit outputs (CDB broadcast) • When ready, instructions can issue directly from the scheduler without reading additional operands from any other register files (Wakeup and select) Fetch & Dispatch Fetch & Dispatch Fetch & Dispatch ARF ARF ARF PRF/ROB PRF/ROB Physical register update Instruction Scheduler Bypass Functional Units Adapted from Prof. G. Loh’s Slide

  35. Instruction Scheduling: Wakeup and Select • Wakeup Logic • To notify the resolution of data dependency of input operands • Wake up instructions with zero input dependency • Select Logic • Choose and fire ready instructions • Deal with structure hazard • Wakeup-select is likely on the critical path • Associative match Prof. Sean Lee’s Slide

  36. Scalar Scheduler (Issue Width = 1) Select Logic Tag Broadcast Bus T14 = T39 To Execute Logic T16 = T39 T8 = T6 = = T42 T17 = T39 = T15 T17 = T39 From Prof. G. Loh’s Slide

  37. T14 = = = = = = = = = = = = = = = = T16 Superscalar Scheduler (Issue Width = 4) Tag Broadcast Bus [3..0] Select Logic T39 To Execute Logic T39 T8 T6 T42 T17 = = = = T39 = = = = T17 T15 = = = = T39 = = = = Snapshot of RS (only 4 entries shown) Adapted from Prof. G. Loh’s Slide

  38. Selection Logic • Select ready instructions to be issued • Goal: to reduce the height of DFG • Methods • Location-based (e.g., leftmost ready first) • Allow simple, faster hardware • Oldest ready first • Can use location-based (in-order issue) with “compaction” • Compact the issue window to the left every time instructions are issued and by inserting new instructions at the right end • Can be slow and complex Prof. Sean Lee’s Slide

  39. Grant3 Grant3 Req0 Req0 Grant0 Grant0 Req1 Req1 Grant1 Grant1 Req2 Req2 Grant02 Grant02 Req3 Req3 Tree-like Arbitrated Selection Logic AnyReq Enable AnyReq Enable Grant3 Grant3 Req0 Req0 Grant0 Grant0 Req1 Req1 Grant1 Grant1 Req2 Req2 Grant02 Grant02 Req3 Req3 AnyReq AnyReq Enable Enable Simple Select Logic Implementation Reservation Station Leftmost ready first • The Enable signal to the root cell is high whenever the functional unit is ready to execute an instruction • The AnyReq signal is raised if any of the input Req signals is high 1 Modified from Prof. Sean Lee’s Slide [Palarchala Dissertation]

  40. Grant3 Grant3 Req0 Req0 Grant0 Grant0 Req1 Req1 Grant1 Grant1 Req2 Req2 Grant02 Grant02 Req3 Req3 AnyReq AnyReq Enable Enable Req0 Req1 Req2 Req3 Grt0 Grt1 Grt2 Grt3 Grant3 Grant3 Req0 Req0 Grant0 Grant0 Req1 Req1 Grant1 Grant1 Req2 Req2 Grant02 Grant02 Req3 Req3 AnyReq AnyReq Enable Enable Priority Decoder AnyReq Enable Simple Select Logic Implementation Reservation Station 1 Prof. Sean Lee’s Slide [Palarchala Dissertation]

  41. Grant3 Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 AnyReq Enable Simple Select Logic Implementation Reservation Station Grant3 Grant3 Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 AnyReq Enable AnyReq Enable Multiple Ready Instruction Request Grant3 Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 AnyReq Enable 1 Prof. Sean Lee’s Slide [Palarchala Dissertation]

  42. Grant3 Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 AnyReq Enable Simple Select Logic Implementation Reservation Station Grant3 Grant3 Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 AnyReq Enable AnyReq Enable Selective Issue for One FU Grant3 Req0 Grant0 Req1 Grant1 Req2 Grant02 Req3 AnyReq Enable 1 Prof. Sean Lee’s Slide [Palarchala Dissertation]

  43. Reservation Station Reservation Station Issues to Distinctive Functional Units Distributed Instruction Windows (e.g., MIPS R1000 or Alpha 21264) Integer Unit FPU Faster to have separate instruction schedulers for different instruction types Prof. Sean Lee’s Slide

  44. Req0 Req1 Req2 Req3 Grant0 Grant1 Grant2 Grant3 Dual Issues to Multiple Units (e.g., 2 Adders) Req0 Req1 Req2 Req3 Selection Logic for Adder0 Grant0 Grant1 Grant2 Grant3 Selection Logic for Adder1 Prof. Sean Lee’s Slide [Palarchala Dissertation]

  45. Memory Disambiguation • Can we “undo” stores? • Stores cannot be committed to memory until they are marked ready to retire • Completed stores are queued and waiting in a store queue or store buffer • Disambiguate (and resolve) memory dependency dynamically Prof. Sean Lee’s Slide

  46. Memory Ordering • Load X bypassing Load X violates certain memory consistency model (e.g., sequential consistency) • Load-load order trap replays Source: Alpha 21264 HRM Prof. Sean Lee’s Slide

  47. ALLOC RS Load Store Queue (LSQ) • Memory instructions are allocated into LSQ in program order • LSQ manages memory reference ordering • Unified LSQ vs. Split LSQ • Sandy Bridge: 64 Load buffers, 36 Store buffers Age-ordered ROB Store Queue Load Queue Split LSQ Prof. Sean Lee’s Slide

  48. 1 0 0 2 1 2 D C A 0 2 ??? Issued to Memory for execution Issuing a Load for Execution • Each load checks against older stores • Associative search • A performance issue of scalability Issued? Issued? age address age address data 1 1 A 00000001 1 1 B 12340000 0 1 C FFFF1111 FFFFFF00 Load Queue Store Queue Prof. Sean Lee’s Slide

  49. 0 1 1 2 2 1 C D A 0 2 ??? Store-to-load forwarding Issuing a Load for Execution • Implementation dependent: comprehensive size matching can be prohibitively expensive • Simple method: forward when a larger store (word) precedes a smaller load (half) Issued? Issued? age address age address data 1 1 A 00000001 1 1 B 12340000 0 1 C FFFF1111 FFFFFF00 Load Queue Store Queue Prof. Sean Lee’s Slide

  50. 0 1 1 1 3 2 2 1 C D A K Issuing a Load for Execution • Can speculatively issue loads for shortening latency (Alpha 21264, Pentium 4 (Prescott)) • Store, when address ready, checks newer loads in the Load Queue • “Replay” needed if speculation turns out to be incorrect (e.g. Alpha’s store-load replay) Issued? Issued? age address age address data 1 1 A 00000001 1 1 B 12340000 Speculatively issue for execution 0 1 C FFFF1111 FFFFFF00 0 2 ??? Load Queue Store Queue Modified from Prof. Sean Lee’s Slide

More Related