Duplicating and Deconstructing Virtual Load/Store Queues

Duplicating and Deconstructing Virtual Load/Store Queues Vikas Garg Sonal Agarwal 5th Annual Workshop on Duplicating, Deconstructing and Debunking

Motivation • Large instruction window and load/store queue to achieve high performance • Speculative executions of memory instructions • Replay traps due to re-ordering of memory accesses. • Pipeline flushes to handle replay traps • Wasted pipeline operations (Power) • Excessive L1 accesses (Power and Locality) 5th Annual Workshop on Duplicating, Deconstructing and Debunking

Motivation • Virtual Load/Store Queue (VLSQ) proposal [Jaleel, HPCA’05] • Use large load store queue for the front end • Throttle memory instructions at issue stage • Reduces the re-ordering of memory instructions • Help in avoiding replay traps • Saves power • No big performance drop Does a VLSQ really work? What if we simply reduce the LSQ size? 5th Annual Workshop on Duplicating, Deconstructing and Debunking

Outline • Motivation • VLSQ Introduction • Simulation Setup • VLSQ Results • VLSQ vs. LSQ • Conclusions 5th Annual Workshop on Duplicating, Deconstructing and Debunking

VLSQ Introduction LD/ST 0 LD/ST 1 LSQ Head LD/ST 2 Virtual Head LD/ST 3 LD/ST 4 LD/ST 5 LD/ST 6 LD/ST 7 ISSUE LD/ST 8 LD/ST 9 FRONT END LD/ST 10 Virtual Tail LD/ST 11 LD/ST 12 LD/ST 13 LD/ST 14 LSQ Tail LD/ST 15 ISSUED NOT READY BLOCKED EMPTY 5th Annual Workshop on Duplicating, Deconstructing and Debunking

VLSQ Pipeline Operation Fetch/Decode Rename Issue Register File Integer Memory Stall Load/Store Queue Stall 5th Annual Workshop on Duplicating, Deconstructing and Debunking

Outline • Motivation • VLSQ Introduction • Simulation Setup • VLSQ Results • VLSQ vs. LSQ • Conclusion 5th Annual Workshop on Duplicating, Deconstructing and Debunking

Simulation Setup • Alpha 21264 simulator (sim-alpha) • I-Cache(64KB, 1Cycle); D-Cache(64KB, 3Cycle) • L2-Cache(2MB, 15Cycle) • 1.3 GB/s DDR SDRAM (DRAMsim) • 1024 entry store-wait table • 2048 line 2-level bimodal branch predictor • Pipeline width: Fetch(8); Issue(8/4); Commit(11) • Functional units: Int(4), Int-Mul(4), FP(1), FP-Mul(1) • Subset of SPEC 2000 benchmark • FP: applu,art,mgrid,swim; INT: gcc,gzip,mcf,twolf • Warm-up: 2 Billion Inst; Data: 500 Million Inst • Reference input 5th Annual Workshop on Duplicating, Deconstructing and Debunking

Simulation Setup (Continued…) • Baseline Out-of-Order Configurations • For VLSQ use baseline LSQ and VLSQ of Inf, 64, 32, 16, 8, 4, and 2 • For LSQ use the VLSQ of Infinity and LSQ size of 64, 32, 16, 8, 4, and 2 5th Annual Workshop on Duplicating, Deconstructing and Debunking

VLSQ - Performance 5th Annual Workshop on Duplicating, Deconstructing and Debunking

VLSQ - Trap Overhead 5th Annual Workshop on Duplicating, Deconstructing and Debunking

VLSQ – Map/Rename Stalls 5th Annual Workshop on Duplicating, Deconstructing and Debunking

VLSQ Pipeline Operation (Continued…) Fetch/Decode Rename Issue Register File Integer Memory Stall Stall Stall Load/Store Queue Stall 5th Annual Workshop on Duplicating, Deconstructing and Debunking

VLSQ Summary • Reduces speculation and replay traps • Not a big performance drop • Saves power • Stall propagates backwards • Need a lot of memory independent instructions • On the critical path? VLSQ works! What if we simply reduce the LSQ size? 5th Annual Workshop on Duplicating, Deconstructing and Debunking

Small Load/Store Queue Fetch/Decode Rename Issue Register File Integer Memory Stall Stall Load/Store Queue 5th Annual Workshop on Duplicating, Deconstructing and Debunking

VLSQ vs. LSQ (Map/Rename Stalls) 5th Annual Workshop on Duplicating, Deconstructing and Debunking

VLSQ vs. LSQ (Performance) 5th Annual Workshop on Duplicating, Deconstructing and Debunking

VLSQ vs. LSQ (Trap Overhead) 5th Annual Workshop on Duplicating, Deconstructing and Debunking

VLSQ vs. LSQ (Summary) ROB Size: 512; VLSQ Size 16; LSQ Size 16 5th Annual Workshop on Duplicating, Deconstructing and Debunking

LSQ Summary • Reduces speculation and replay traps • Performance vs. power tradeoff better than that for VLSQ • Simpler than VLSQ • Not on the critical path • Additional power saving from a smaller LSQ VLSQ works BUT… Reducing LSQ size is better than using VLSQ! 5th Annual Workshop on Duplicating, Deconstructing and Debunking

Dynamic Throttling • Easy to do dynamic throttling using VLSQ • Just need to tweak the VLSQ window size • Might be better to just vary the LSQ size • Maybe we can just shut down parts of the LSQ • Better to throttle in the issue stage using • Just in time instruction delivery [Karkhanis, ISPLED‘02] 5th Annual Workshop on Duplicating, Deconstructing and Debunking

Conclusions • Speculative execution of memory instructions leads to wasted power due to replay traps • VLSQ helps to reduce memory re-ordering and replay traps • LSQ is more effective • For power saving it is better to throttle earlier in the pipeline 5th Annual Workshop on Duplicating, Deconstructing and Debunking

Duplicating and Deconstructing Virtual Load/Store Queues Questions? 5th Annual Workshop on Duplicating, Deconstructing and Debunking

Duplicating and Deconstructing Virtual Load/Store Queues