270 likes | 347 Views
Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification. Alok Garg, M. W. Rashid, and Michael Huang Department of Electrical & Computer Engineering University of Rochester. Motivation.
E N D
Slackened Memory Dependence Enforcement:Combining Opportunistic Forwarding with Decoupled Verification Alok Garg, M. W. Rashid, and Michael Huang Department of Electrical & Computer Engineering University of Rochester
Motivation • Out-of-order execution needs efficient memory dependence enforcement logic • Conventional approach – complex, hard to scale • Tightly coupled forwarding and enforcement • We use two decoupled components to simplify the task • Opportunistic forwarding using L0 cache • Verification against in-order re-execution • Slackened memory dependence enforcement (SMDE) "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
LSQ: complex & hard to scale • Needs priority CAMs • Forwarding from LSQ on timing critical path • Serialized with address translation • Design further complicated by • Coherence and consistency considerations • Corner cases: e.g., partial overlap of operands "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
Highlights of prior work • Two-level load store queue [sethumadhavan03], [akkary03], [baugh04], [roth04], [torres05], [gandhi05] • Reducing search frequency using clever filtering and prediction mechanism [park03], [sethumadhavan03] • Memory dependence prediction [moshovos.isca97], [moshovos.micro97], [sha05], [stone05] • Value based re-execution[cain04], [roth04], [sha05] (more detailed contrast in paper) "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
Outline • Overview of SMDE • Optional performance optimizations • Evaluation • Conclusion "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
Overview of SMDE "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
Decoupled execution LSQ • LSQ: competing requirements • Front-end execution: little mem dependence enforcement • Back-end execution: detect violations (mem access only) • Memory B/W: naturally handled Fetch/Decode/Dispatch Execution (out-of-order) Commit Front-end execution Back-end execution MUX L0 L1 Memory Hierarchy "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
Why it works – two perspectives • Back-end execution is the only one required • Totally in-order, preserving dependence • Any front-end execution is OK • L0 effectively a slow but accurate value predictor • Front-end execution correct most of the time • Common case: 99% of loads happen at right time • Speculation is on timing of load store pairs • Two-level LSQ speculate on the scope of stores • Relatively expensive replays OK "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
Advantage – simplicity • No priority CAM • Decoupled design – flexible, modular • Front end – large degree of freedom • No need for address translation • Soft errors can be ignored (ECC not needed) • Corner cases – handle partial overlaps naturally • Can ignore coherence invalidations "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
Performance of naïve design LQ: 64 SQ: 48 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
Optional performance optimizations "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
Reducing replay frequency • Major replay cause – RAW violations • 48% replays due to RAW violation • Replays indirectly cause more replays • Often address available (data is not) • Fuzzy disambiguation queue (FDQ) • Reject known premature loads • Best effort enough, no need to guarantee anything • Conventional LSQ handles this (e.g., POWER 4) "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
FDQ: How it works 1 2 3 4 5 6 ST LD Old New ROB Address AGE Address Address AGE 2 Fuzzy Disambiguation Queue Address AGE Address AGE Address AGE "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
FDQ not complex • Very different from conventional SQ • Does not have priority logic • No need to merge with cache data path • Small queue is sufficient – no scalability pressure • Stores do not stay in FDQ for the entire lifetime • Flexible replacement • A “local” technique • Only support needed load rejection • No need to augment issue logic to enforce predicted dependence "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
Write buffer at the back-end • Temporarily holds not yet committed stores • Allow back-end execution of loads and stores to start early • A few entries sufficient to streamline back-end execution "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
Evaluation "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
Evaluation environment • Simulator strives to model SMDE very faithfully • Load speculation, load rejection, and store-load replay • Data value in the caches • Scheduling replays • Do not allocate load queue entry for pre-fetches • SPEC CPU2000 benchmark suite • System configuration • ROB/Register (INT, FP) – 512/(400,400) • LSQ (LQ, SQ) – 112 (64, 48) • L0 speculative cache – 16KB, 2-way, 1 cycle "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
Impact of 8-entry Write buffer "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
Replay frequency reduction (a) Integer applications. (b) Floating-point applications. "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
Replay breakdown "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
Performance improvement "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
Scalability test Memory dependence logic unchanged ROB, RFs, IQs doubled "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
Other details in paper • Scope of replay • Detailed study on replay causes • Replay suppression technique • Age based filtering • Discussion on L0 flush policy • Understanding write buffer • Membership test for write buffer * “Implementation Issues of Slackened Memory Dependence Enforcement”, A. Garg, M. Rashid, and M. Huang, Technical Report. "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
Conclusions • Common-case forwarding and correctness guarantee separately handled • Decoupled execution allows modular design, verification, and optimization • Forwarding logic is simple to design and incurs minimal interference on execution • Scales very well • Can achieve close to ideal performance "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
Slackened Memory Dependence Enforcement:Combining Opportunistic Forwarding with Decoupled Verification Alok Garg, M. W. Rashid, and Michael Huang Department of Electrical & Computer Engineering University of Rochester Link to technical report: http://www.ece.rochester.edu/~garg/documents/isca06tr.pdf
Streamlining back-end execution Cycles 1 2 3 4 5 6 7 1 1 Age – old to new 2 2 Verification commit ST ST Bubble LD LD LD LD ROB Reload 3 3 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006
Streamlining back-end execution Insert write buffer at the commit stage Cycles 1 2 3 4 5 6 7 1 1 Age – old to new 2 2 ST WB CT LD RL RL LD ROB 3 3 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006