1 / 27

Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, amir}@cis.upenn

Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors. Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, amir}@cis.upenn.edu. ISCA-36 :: June 23, 2009. Brief Overview. Latency-tolerant processors.

psherlock
Download Presentation

Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, amir}@cis.upenn

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decoupled Store CompletionSilent Deterministic ReplayEnabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, amir}@cis.upenn.edu ISCA-36:: June 23, 2009

  2. Brief Overview Latency-tolerant processors Scalable load & store queues CPR/CFP [Akkary03, Srinivasan04] SVW/SQIP [Roth05, Sha05] Scalable load & store queues for latency-tolerant processors SA-LQ/HSQ [Akkary03] SRL [Gandhi05] ELSQ [Pericas08] Dynamically scheduled superscalar processors DKIP, FMC [Pericas06, Pericas07] Granularity mismatch: checkpoint (CPR) vs. instruction (SVW/SQIP) Decoupled Store Completion & Silent Deterministic Replay

  3. Outline Background CPR/CFP SVW/SQIP The granularity mismatch problem DSC/SDR Evaluation

  4. CPR/CFP Latency-tolerant: scale key window structures under LL$ miss Issue queue, regfile, load & store queues CFP (Continual Flow Pipeline)[Srinivasan04] Scale issue queue & regfile by “slicing out” miss-dependent insns CPR (Checkpoint Processing & Recovery)[Akkary03] Scale regfile by limiting recovery to pre-created checkpoints Aggressive reclamation of non-checkpoint registers Unintended consequence? checkpoint-granularity “bulk commit” SA-LQ (Set-Associative Load Queue) [Akkary03] HSQ (Hierarchical Store Queue) [Akkary03]

  5. Baseline Performance (& Area) • ASSOC (baseline): 64/48 entry fully-associative load/store queues • 8SA-LQ/HSQ: 512-entry load queue, 256-entry store queue • Load queue: area is fine, poor performance (set conflicts) • Store queue: performance is fine, area inefficient (large CAM)

  6. SQIP SQIP (Store Queue Index Prediction)[Sha05] Scales store queue/buffer by eliminating associative search @dispatch: load predicts store queue position of forwarding store @execute: load indexes store queue at this position younger instruction stream older [?] [?] [?] [x20] [x10] T:Br S:+ R:Ld Q:St P:St … B:Ld A:St <9> <8> <4> commit <ssn=4> dispatch <ssn=9> addresses P:St SSNs <8> Preliminaries: SSNs (Store Sequence Numbers) [Roth05] • Stores named by monotonically increasing sequence numbers • Low-order bits are store queue/buffer positions • Global SSNs track dispatch, commit, (store) completion

  7. SVW Store Vulnerability Window (SVW)[Roth05] Scales load queue by eliminating associative search Load verification by in-order re-execution prior to commit Highly filtered: <1% of loads actually re-execute [x18] [x18] [x20] [x18] [x18] [x20] [x10] T:Br S:+ R:Ld Q:St P:St … B:Ld A:St <8> <8> <9> <8> <8> <4> commit <9> complete <3> x?0 x20 <9> SSBF (SSN Bloom Filter) x?8 x18 <8> x?8 x18 <8> verify/ • Address-indexed SSBF tracks [addr, SSN] of commited stores • @commit: loads check SSBF, re-execute if possibly incorrect

  8. SVW–NAIVE • SVW: 512-entry indexed load queue, 256-entry store queue • Slowdowns over 8SA-LQ (mesa, wupwise) • Some slowdowns even over ASSOC too (bzip2, vortex) • Why? Not forwarding mis-predictions … store-load serialization • Load Y can’t verify until older store X completes to D$

  9. Store-Load Serialization: ROB SVW/SQIP example: SSBF verification “hole” Load R forwards from store <4>  vulnerable to stores <5>–<9> No SSBF entry for address [x10]  must replay Can’t search store buffer  wait until stores <5>–<8> in D$ In a ROB processor … <8> (P) will complete (and usually quickly) In a CPR processor … [x10] [x20] [x18] [x20] [x10] T:Br S:+ R:Ld Q:St P:St … B:Ld A:St <4> <9> <8> <4> complete <3> complete <8> verify/commit <9> verify/commit <9> x?0 x?0 x20 x20 <9> <9> x?8 x?8 x18 x18 <8> <8> x?0 x20 <9>

  10. Store-Load Serialization: CPR P will complete … unless it’s in same checkpoint as R Deadlock: load R can’t verify  store P can’t complete Resolve: squash (ouch), on re-execute, create checkpoint before R P and R will be in separate checkpoints Better: learn and create checkpoints before future instances of R This is SVW–TRAIN commit [x10] [x20] [x18] [x20] [x10] T:Br S:+ R:Ld Q:St P:St … B:Ld A:St <4> <9> <8> <4> complete <3> verify <9> x?0 x20 <9> x?8 x18 <8>

  11. SVW–TRAIN • Better than SVW–NAÏVE • But worse in some cases (art, mcf, vpr) • Over-checkpointing holds too many registers • Checkpoint may not be available for branches

  12. What About Set-Associative SSBFs? • Higher associativity helps (reduces hole frequency) but … • We’re replacing store queue associativity with SSBF associativity • Trying to avoid things like this • Want a better solution…

  13. DSC (Decoupled Store Completion) No fundamental reason we cannot complete stores <4> – <9> All older instructions have completed What’s stopping us? definition of commit & architected state CPR: commit = oldest register checkpoint (checkpoint granularity) ROB: commit = SVW-verify (instruction granularity) Restore ROB definition Allow stores to complete past oldest checkpoint This is DSC (Decoupled Store Completion) commit commit [x10] [x20] [x18] [x20] [x10] T:Br S:+ R:Ld Q:St P:St … B:Ld A:St <4> <9> <8> <4> complete <3> complete <8> verify/commit <9> verify <6>

  14. DSC: What About Mis-Speculations? DSC: Architected state younger than oldest checkpoint What about mis-speculation (e.g., branch T mis-predicted)? Can only recover to checkpoint Squash committed instructions? Squash stores visible to other processors? etc. How do we recover architected state? [x10] [x20] [x18] [x20] [x10] T:Br S:+ R:Ld Q:St P:St … B:Ld A:St <4> <9> <8> <4> complete <8> ? T:Br verify/commit <9>

  15. Silent Deterministic Recovery (SDR) Reconstruct architected state on demand Squash to oldest checkpoint and replay … Deterministically: re-produce committed values Silently: without generating coherence events How? discard committed stores at rename (already in SB or D$) How? read load values from load queue Avoid WAR hazards with younger stores Same thread (e.g., BQ) or different thread (coherence) complete <8> [x10] [x20] [x18] [x20] [x10] T:Br S:+ R:Ld Q:St P:St … B:Ld A:St <4> <9> <8> <4> verify/commit <9>

  16. Outline Background DSC/SDR (yes, that was it) Evaluation Performance Performance-area trade-offs

  17. Performance Methodology Workloads SPEC2000, Alpha AXP ISA, -O4, train inputs, 2% periodic sampling Cycle-level simulator configuration 4-way superscalar out-of-order CPR/CFP processor 8 checkpoints, 32/32 INT/FP issue queue entries 32KByte D$, 15-cycle 2MByte L2, 8 8-entry stream prefetchers 400 cycle memory, 4Byte/cycle memory bus

  18. SVW+DSC/SDR • Outperforms SVW–Naïve and SVW–Train • Outperforms 8SA-LQ on average (by a lot) • Occasional slight slowdowns (eon, vortex) relative to 8SA-LQ • These are due to forwarding mis-speculation

  19. Smaller, Less-Associative SSBFs Does DSC/SDR make set-associative SSBFs unnecessary? You can bet your associativity on it

  20. Fewer Checkpoints DSC/SDR reduce need for large numbers of checkpoints • Don’t need checkpoints to serialize store/load pairs • Efficient use of D$ bandwidth even with widely spaced checkpoints • Good: checkpoints are expensive

  21. … And Less Area Area methodology CACTI-4[Tarjan04], 45nm Sum areas for load/store queues (SSBF & predictor too if needed) E.g., 512-entry 8SA-LQ / 256-entry HSQ High-performance/low-area 6.6% speedup, 0.91mm2

  22. How Performance/Area Was Won SVW load queue: big performance gain (no conflicts) & small area loss SQIP store queue: small performance loss & big area gain (no CAM) Big SVW performance gain offsets small SQIP performance loss Big SQIP area gain offsets small SVW area loss DSC/SDR: big performance gain & small area gain

  23. DSC/SDR Performance/Area DSC/SDR improve SVW/SQIP IPC and reduce its area No new structures, just new ways of using existing structures No SSBF checkpoints No checkpoint-creation predictor More tolerant to reduction in checkpoints, SSBF size

  24. Pareto Analysis SVW/SQIP+DSC/SDR dominates all other designs SVW/SQIP are low area (no CAMs) DSC/SDR needed to match IPC of fully-associative load queue (FA-LQ)

  25. Related Work SRL (Store Redo Log)[Gandhi05] Large associative store queue  FIFO buffer + forwarding cache Expands store queue only under LL$ misses  under-performs HSQ Unordered late-binding load/store queues [Sethumadhavan08] Entries only for executed loads and stores Poor match for centralized latency tolerant processors Cherry [Martinez02] “Post retirement” checkpoints No large load/store queues, but may benefit from DSC/SDR Deterministic replay (e.g., race debugging) [Xu04, Narayanasamy06]

  26. Conclusions Checkpoint granularity … … register management: good … store commit: somewhat painful DSC/SDR: the good parts of the checkpoint world Checkpoint granularity registers + instruction granularity stores Key 1: disassociate commit from oldest register checkpoint Key 2: reconstruct architected state silently on demand Committed load values available in load queue Allow checkpoint processor to use SVW/SQIP load/store queues Performance and area advantages Simplify multi-processor operation for checkpoint processors

  27. [ 27 ]

More Related