1 / 31

Eliminating the Store Queue

Eliminating the Store Queue. Stephen Somogyi. CALCM Seminar April 3, 2007. Out-of-order Execution. Extracts instruction level parallelism Independent instructions run out of program order Hides 10’s of cycles of latency Effectiveness dependent on window size

Download Presentation

Eliminating the Store Queue

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Eliminating the Store Queue Stephen Somogyi CALCM Seminar April 3, 2007

  2. Out-of-order Execution • Extracts instruction level parallelism • Independent instructions run out of program order • Hides 10’s of cycles of latency • Effectiveness dependent on window size • Larger window  higher performance • Instruction window difficult to scale • ROB, LSQ, wakeup, select…

  3. Store Queue • Part of load-store queue • Enforces ordering between loads and stores • Forwards data from stores to loads (same addr.) • In-order commit to memory system • Structure: CAM • Age-based search Newest Load A Store C Store A Load A Load B Store A Oldest Drain to cache

  4. Solution: No Store Queue! • Store-load forwarding very repetitive • Between pairs of instructions • Predict & forward value directly from store to load • In-order re-execution for mispredictions • Commit already in program order • Buffer store values in register file until commit • No need for store queue

  5. Contributions [MICRO `06] • Fire-and-Forget [Subramaniam and Loh, Georgia Tech] • Pushes store’s value into load queue • NoSQ [Sha, Martin and Roth, UPenn] • Pulls store’s value from register file • Both eliminate the store queue (solve scalability) • No performance loss

  6. Outline • Store Vulnerability Window • Fire-and-Forget • NoSQ • Results • Conclusion

  7. Store Vulnerability Window (SVW)[Roth, ISCA `05] • Motivation: • Frequent load re-execution hurts performance • Need to know which addresses written long ago • Solution: • Maintain dynamic store sequence number (SSN) • Track most recent store to each address • Using SSNs, can determine if a load is vulnerable

  8. SVW Operation • At store commit: • Track SSN in store sequence bloom filter (SSBF)SSBF[address] = SSNcommit • SSNcommit++ SSNcommit = 11 SSBF Store A A, SSN: 11

  9. SVW Operation • At load dispatch: • Remember SSNnvul = SSNcommit • Prior to load commit: • Lookup SSNcheck = SSBF[address] • Load is vulnerable only if SSNcheck > SSNnvul • Bloom filter aliasing • Possible false positives, but no false negatives

  10. Outline • Store Vulnerability Window • Fire-and-Forget • NoSQ • Results • Conclusion

  11. Fire-and-Forget • Pushes store value to load • Maintain dynamic load sequence number (LSN) • Incremented on every load • Stored in load queue

  12. FnF Operation (1) • A store tracks where it would have been in LQ • Take LSNstore at rename • Save in store PC table (SPCT) on commit Load Queue SPCT newest LSN=9 A, PC1, LSN:9 PC1: Store A LSN=8 LSN=7 LSN=6 LSN=5 oldest

  13. FnF Operation (2) • A load re-executes, finds incorrect value • Looks up most recent store to its address • Distance = LSNcurrent – LSNstore • Saves distance in load distance predictor (LDP) re-execute:PC2: Load A (LSN = 23) SPCT LDP A, PC1, LSN:9 PC1, dist: 14

  14. FnF Operation (3) • Also on load re-execution • Set load consumption predictor (LCP) flag for load re-execute:PC2: Load A (LSN = 23) LCP PC2, useFwd: 1

  15. Firing and Forgetting • When ready, store consults LDP • Predicted LQ entry = LSNcurrent + LDPdist • Write value into predicted entry • At dispatch, load consults LCP • If flag set, use forwarding (may have to wait) • If not set, use cache

  16. Final Detail • SQ still used for storing values in program order • Instead, use ROB • May need to allocate physical register to hold value • Then SQ serves no remaining purpose

  17. Outline • Store Vulnerability Window • Fire-and-Forget • NoSQ • Results • Conclusion

  18. No Store Queue (NoSQ) • Pulls store value to load • Opposite of FnF • Does not dispatch stores to OoO engine • Uses dynamic store sequence numbers (SSN) • Separate counters for rename and commit

  19. General NoSQ Forwarding • For loads far from forwarding store: • Forward through cache • For loads close to forwarding store: • Forward through register file • Map register input to store as output from load • Speculative Memory Bypassing [Moshovos and Sohi, MICRO `97]

  20. Stores (1) • At rename: • Update store register queue (SRQ) • SSNrename++ SSNrename = 8 SRQ Store A,src reg. = 43 SSN:8, reg43

  21. Stores (2) • At commit: • Normal operation for SVW • Update SSBF[address] = SSNcommit • SSNcommit++ SSNcommit = 11 SSBF Store A A, SSN: 11

  22. Loads (1) • At commit: • Train store-load bypassing predictor (SLBP) • With distance from most recent store • Distance = SSNcommit – SSBF[address] SSNcommit = 27 SSBF SLBP PC2: Load A A, SSN: 11 PC2, dist: 16 –

  23. Loads (2) • At rename: • Lookup predictor, calculate distance, lookup SRQ • Compare SSNpred with SSNcommit • Determine if most recent store already in cache SSNrename = 47 SLBP SRQ PC2: Load A PC2, dist: 16 SSN:31, reg22 –

  24. Bypass During Rename At rename, a load may: • miss in predictor • Normal OoO execution; goes to cache • hit in predictor, store already committed • Normal OoO execution; goes to cache • hit in predictor, store not committed • Forward through register file • non-trivial forwarding (e.g., narrow write) • Delay load until store commits

  25. Similarities • Rely on in-order back end • To correct mispredictions • SVW is the enabling idea • To update cache in program order • Additional register file pressure • Use dynamic instruction sequence numbers • For distance prediction • Most bypassed loads never access cache

  26. Outline • Store Vulnerability Window • Fire-and-Forget • NoSQ • Results • Conclusion

  27. Methodology • SimpleScalar • Aggressive out-of-order configurations • ROB: 512-entry (NoSQ), 128- & 256-entry (FnF) • SPEC CPU2000 & Mediabench • Additional pointer/game apps for FnF

  28. Performance Results • Expectation: • Lower performance due to mis-speculations • Result: • Performance improves 1%–3% (both NoSQ & FnF) • Explanation: • Reduced issue queue / cache port contention

  29. Selected Other Results • Power for Fire-and-Forget • CACTI 4.1, 65nm technology • FnF uses 50% power of traditional associative SQ • However, only accounts for 2% of total power • Data cache read bandwidth for NoSQ • Store-load forwarding reduces cache pressure • Load re-execution increases pressure • On average, ~10% reduction in read bandwidth

  30. Related Work • Non-Associative Load Queue [Cain & Lipasti, ISCA 04] • Load re-execution to address power in LQ • Stores never forward, only write to cache • Store Queue Index Prediction [Sha, et al, MICRO 05] • Extension to SVW • Direct-mapped indexing instead of associative SQ • Much work on reducing complexity • Partitioned / hierarchical designs • Reduce, but do not eliminate, age-based search

  31. Conclusion • Enabling idea: store vulnerability window • Lightweight detection for load re-execution • NoSQ and Fire-and-Forget • Predict store-load forwarding • Based on dynamic load/store instruction counts • Eliminate need for store queue structure • No performance loss relative to traditional SQ

More Related