1 / 26

Duplicating and Deconstructing Virtual Load/Store Queues

Duplicating and Deconstructing Virtual Load/Store Queues. Vikas Garg Sonal Agarwal. Motivation. Large instruction window and load/store queue to achieve high performance Speculative executions of memory instructions Replay traps due to re-ordering of memory accesses.

glynis
Download Presentation

Duplicating and Deconstructing Virtual Load/Store Queues

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Duplicating and Deconstructing Virtual Load/Store Queues Vikas Garg Sonal Agarwal 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  2. Motivation • Large instruction window and load/store queue to achieve high performance • Speculative executions of memory instructions • Replay traps due to re-ordering of memory accesses. • Pipeline flushes to handle replay traps • Wasted pipeline operations (Power) • Excessive L1 accesses (Power and Locality) 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  3. Motivation • Virtual Load/Store Queue (VLSQ) proposal [Jaleel, HPCA’05] • Use large load store queue for the front end • Throttle memory instructions at issue stage • Reduces the re-ordering of memory instructions • Help in avoiding replay traps • Saves power • No big performance drop Does a VLSQ really work? What if we simply reduce the LSQ size? 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  4. Outline • Motivation • VLSQ Introduction • Simulation Setup • VLSQ Results • VLSQ vs. LSQ • Conclusions 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  5. VLSQ Introduction LD/ST 0 LD/ST 1 LSQ Head LD/ST 2 Virtual Head LD/ST 3 LD/ST 4 LD/ST 5 LD/ST 6 LD/ST 7 ISSUE LD/ST 8 LD/ST 9 FRONT END LD/ST 10 Virtual Tail LD/ST 11 LD/ST 12 LD/ST 13 LD/ST 14 LSQ Tail LD/ST 15 ISSUED NOT READY BLOCKED EMPTY 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  6. VLSQ Pipeline Operation Fetch/Decode Rename Issue Register File Integer Memory Stall Load/Store Queue Stall 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  7. Outline • Motivation • VLSQ Introduction • Simulation Setup • VLSQ Results • VLSQ vs. LSQ • Conclusion 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  8. Simulation Setup • Alpha 21264 simulator (sim-alpha) • I-Cache(64KB, 1Cycle); D-Cache(64KB, 3Cycle) • L2-Cache(2MB, 15Cycle) • 1.3 GB/s DDR SDRAM (DRAMsim) • 1024 entry store-wait table • 2048 line 2-level bimodal branch predictor • Pipeline width: Fetch(8); Issue(8/4); Commit(11) • Functional units: Int(4), Int-Mul(4), FP(1), FP-Mul(1) • Subset of SPEC 2000 benchmark • FP: applu,art,mgrid,swim; INT: gcc,gzip,mcf,twolf • Warm-up: 2 Billion Inst; Data: 500 Million Inst • Reference input 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  9. Simulation Setup (Continued…) • Baseline Out-of-Order Configurations • For VLSQ use baseline LSQ and VLSQ of Inf, 64, 32, 16, 8, 4, and 2 • For LSQ use the VLSQ of Infinity and LSQ size of 64, 32, 16, 8, 4, and 2 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  10. Outline • Motivation • VLSQ Introduction • Simulation Setup • VLSQ Results • VLSQ vs. LSQ • Conclusion 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  11. VLSQ - Performance 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  12. VLSQ - Trap Overhead 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  13. VLSQ – Map/Rename Stalls 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  14. VLSQ Pipeline Operation (Continued…) Fetch/Decode Rename Issue Register File Integer Memory Stall Stall Stall Load/Store Queue Stall 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  15. VLSQ Summary • Reduces speculation and replay traps • Not a big performance drop • Saves power • Stall propagates backwards • Need a lot of memory independent instructions • On the critical path? VLSQ works! What if we simply reduce the LSQ size? 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  16. Outline • Motivation • VLSQ Introduction • Simulation Setup • VLSQ Results • VLSQ vs. LSQ • Conclusion 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  17. Small Load/Store Queue Fetch/Decode Rename Issue Register File Integer Memory Stall Stall Load/Store Queue 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  18. VLSQ vs. LSQ (Map/Rename Stalls) 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  19. VLSQ vs. LSQ (Performance) 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  20. VLSQ vs. LSQ (Trap Overhead) 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  21. VLSQ vs. LSQ (Summary) ROB Size: 512; VLSQ Size 16; LSQ Size 16 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  22. LSQ Summary • Reduces speculation and replay traps • Performance vs. power tradeoff better than that for VLSQ • Simpler than VLSQ • Not on the critical path • Additional power saving from a smaller LSQ VLSQ works BUT… Reducing LSQ size is better than using VLSQ! 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  23. Dynamic Throttling • Easy to do dynamic throttling using VLSQ • Just need to tweak the VLSQ window size • Might be better to just vary the LSQ size • Maybe we can just shut down parts of the LSQ • Better to throttle in the issue stage using • Just in time instruction delivery [Karkhanis, ISPLED‘02] 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  24. Conclusions • Speculative execution of memory instructions leads to wasted power due to replay traps • VLSQ helps to reduce memory re-ordering and replay traps • LSQ is more effective • For power saving it is better to throttle earlier in the pipeline 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  25. Duplicating and Deconstructing Virtual Load/Store Queues Questions? 5th Annual Workshop on Duplicating, Deconstructing and Debunking

  26. Duplicating and Deconstructing Virtual Load/Store Queues Questions? 5th Annual Workshop on Duplicating, Deconstructing and Debunking

More Related