260 likes | 377 Views
Duplicating and Deconstructing Virtual Load/Store Queues. Vikas Garg Sonal Agarwal. Motivation. Large instruction window and load/store queue to achieve high performance Speculative executions of memory instructions Replay traps due to re-ordering of memory accesses.
E N D
Duplicating and Deconstructing Virtual Load/Store Queues Vikas Garg Sonal Agarwal 5th Annual Workshop on Duplicating, Deconstructing and Debunking
Motivation • Large instruction window and load/store queue to achieve high performance • Speculative executions of memory instructions • Replay traps due to re-ordering of memory accesses. • Pipeline flushes to handle replay traps • Wasted pipeline operations (Power) • Excessive L1 accesses (Power and Locality) 5th Annual Workshop on Duplicating, Deconstructing and Debunking
Motivation • Virtual Load/Store Queue (VLSQ) proposal [Jaleel, HPCA’05] • Use large load store queue for the front end • Throttle memory instructions at issue stage • Reduces the re-ordering of memory instructions • Help in avoiding replay traps • Saves power • No big performance drop Does a VLSQ really work? What if we simply reduce the LSQ size? 5th Annual Workshop on Duplicating, Deconstructing and Debunking
Outline • Motivation • VLSQ Introduction • Simulation Setup • VLSQ Results • VLSQ vs. LSQ • Conclusions 5th Annual Workshop on Duplicating, Deconstructing and Debunking
VLSQ Introduction LD/ST 0 LD/ST 1 LSQ Head LD/ST 2 Virtual Head LD/ST 3 LD/ST 4 LD/ST 5 LD/ST 6 LD/ST 7 ISSUE LD/ST 8 LD/ST 9 FRONT END LD/ST 10 Virtual Tail LD/ST 11 LD/ST 12 LD/ST 13 LD/ST 14 LSQ Tail LD/ST 15 ISSUED NOT READY BLOCKED EMPTY 5th Annual Workshop on Duplicating, Deconstructing and Debunking
VLSQ Pipeline Operation Fetch/Decode Rename Issue Register File Integer Memory Stall Load/Store Queue Stall 5th Annual Workshop on Duplicating, Deconstructing and Debunking
Outline • Motivation • VLSQ Introduction • Simulation Setup • VLSQ Results • VLSQ vs. LSQ • Conclusion 5th Annual Workshop on Duplicating, Deconstructing and Debunking
Simulation Setup • Alpha 21264 simulator (sim-alpha) • I-Cache(64KB, 1Cycle); D-Cache(64KB, 3Cycle) • L2-Cache(2MB, 15Cycle) • 1.3 GB/s DDR SDRAM (DRAMsim) • 1024 entry store-wait table • 2048 line 2-level bimodal branch predictor • Pipeline width: Fetch(8); Issue(8/4); Commit(11) • Functional units: Int(4), Int-Mul(4), FP(1), FP-Mul(1) • Subset of SPEC 2000 benchmark • FP: applu,art,mgrid,swim; INT: gcc,gzip,mcf,twolf • Warm-up: 2 Billion Inst; Data: 500 Million Inst • Reference input 5th Annual Workshop on Duplicating, Deconstructing and Debunking
Simulation Setup (Continued…) • Baseline Out-of-Order Configurations • For VLSQ use baseline LSQ and VLSQ of Inf, 64, 32, 16, 8, 4, and 2 • For LSQ use the VLSQ of Infinity and LSQ size of 64, 32, 16, 8, 4, and 2 5th Annual Workshop on Duplicating, Deconstructing and Debunking
Outline • Motivation • VLSQ Introduction • Simulation Setup • VLSQ Results • VLSQ vs. LSQ • Conclusion 5th Annual Workshop on Duplicating, Deconstructing and Debunking
VLSQ - Performance 5th Annual Workshop on Duplicating, Deconstructing and Debunking
VLSQ - Trap Overhead 5th Annual Workshop on Duplicating, Deconstructing and Debunking
VLSQ – Map/Rename Stalls 5th Annual Workshop on Duplicating, Deconstructing and Debunking
VLSQ Pipeline Operation (Continued…) Fetch/Decode Rename Issue Register File Integer Memory Stall Stall Stall Load/Store Queue Stall 5th Annual Workshop on Duplicating, Deconstructing and Debunking
VLSQ Summary • Reduces speculation and replay traps • Not a big performance drop • Saves power • Stall propagates backwards • Need a lot of memory independent instructions • On the critical path? VLSQ works! What if we simply reduce the LSQ size? 5th Annual Workshop on Duplicating, Deconstructing and Debunking
Outline • Motivation • VLSQ Introduction • Simulation Setup • VLSQ Results • VLSQ vs. LSQ • Conclusion 5th Annual Workshop on Duplicating, Deconstructing and Debunking
Small Load/Store Queue Fetch/Decode Rename Issue Register File Integer Memory Stall Stall Load/Store Queue 5th Annual Workshop on Duplicating, Deconstructing and Debunking
VLSQ vs. LSQ (Map/Rename Stalls) 5th Annual Workshop on Duplicating, Deconstructing and Debunking
VLSQ vs. LSQ (Performance) 5th Annual Workshop on Duplicating, Deconstructing and Debunking
VLSQ vs. LSQ (Trap Overhead) 5th Annual Workshop on Duplicating, Deconstructing and Debunking
VLSQ vs. LSQ (Summary) ROB Size: 512; VLSQ Size 16; LSQ Size 16 5th Annual Workshop on Duplicating, Deconstructing and Debunking
LSQ Summary • Reduces speculation and replay traps • Performance vs. power tradeoff better than that for VLSQ • Simpler than VLSQ • Not on the critical path • Additional power saving from a smaller LSQ VLSQ works BUT… Reducing LSQ size is better than using VLSQ! 5th Annual Workshop on Duplicating, Deconstructing and Debunking
Dynamic Throttling • Easy to do dynamic throttling using VLSQ • Just need to tweak the VLSQ window size • Might be better to just vary the LSQ size • Maybe we can just shut down parts of the LSQ • Better to throttle in the issue stage using • Just in time instruction delivery [Karkhanis, ISPLED‘02] 5th Annual Workshop on Duplicating, Deconstructing and Debunking
Conclusions • Speculative execution of memory instructions leads to wasted power due to replay traps • VLSQ helps to reduce memory re-ordering and replay traps • LSQ is more effective • For power saving it is better to throttle earlier in the pipeline 5th Annual Workshop on Duplicating, Deconstructing and Debunking
Duplicating and Deconstructing Virtual Load/Store Queues Questions? 5th Annual Workshop on Duplicating, Deconstructing and Debunking
Duplicating and Deconstructing Virtual Load/Store Queues Questions? 5th Annual Workshop on Duplicating, Deconstructing and Debunking