1 / 24

A Performance-Correctness Explicitly-Decoupled Architecture

A Performance-Correctness Explicitly-Decoupled Architecture. Alok Garg and Michael Huang Department of Electrical & Computer Engineering University of Rochester. Motivation. Performance optimization in a monolithic micro-architecture is difficult

fuller
Download Presentation

A Performance-Correctness Explicitly-Decoupled Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Performance-Correctness Explicitly-Decoupled Architecture Alok Garg and Michael Huang Department of Electrical & Computer Engineering University of Rochester

  2. Motivation • Performance optimization in a monolithic micro-architecture is difficult • Conservativeness in design reduces the common case efficacy • Want to explicitly decouple correctness & performance • Optimization 1 (e.g. branch prediction) • Optimization 2 (e.g. out-of-order execution) IF MEM EX ID WB "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

  3. Explicitly-Decoupled Architecture (EDA) Software layer • Design separated into performance and correctness domains • Implementation decoupled as well • Optimistic design of entire system stack • Economic correctness guarantee • Custom software-hardware interface Architectural layer Performance Domain Correctness Domain Simple throughput- oriented Optimistic core Correctness core Hints Device layer "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

  4. Branch Outcome Queue L0 L1 L2 ILP lookahead using EDA • Autonomy • Managing deviance Lookahead agent Throughput engine Program (semantic) binary Program (semantic) binary Performance domain Correctness domain Static binary transformation Skeleton Optimistic core Correctness core Minimal mutual dependence "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

  5. Outline • Architectural and software support needed • Performance optimization opportunities • Complexity reduction opportunities • Evaluation • Conclusion "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

  6. Avoiding L2-miss stalls in lookahead • Feed arbitrary value • Exact value may not matter • Conventional mechanism • Planning against contingency • Tagging entire dependency chain as invalid • State check-point and recovery • Type of value substitution • Value predictor • Explicitly flush the dependence chain of load • Opportunity : simple “0” value substitution • Only used when optimistic core is not too far ahead • Zero most frequent occurring value compare (x>f0) "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

  7. Purging stale data • Source of stale data • Performance optimizations • Binary optimizations • Potential Solutions • Timer based eviction mechanism • Selective L0 invalidations from skeleton • Choice : do nothing • Simply rely on cache replacement OC CC L0 L1 L2 "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

  8. Complexity reduction • Optimistic core – tradeoff complexity to improve performance • E.g., Load Store Queue • Correctness core – throughput oriented design • Accurate branch prediction from OC • No check-pointing and selective pipeline flush required • Cache misses are significantly mitigated • Latency of various operations is less critical "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

  9. Load-Hit speculation Processor Pipeline Issue Reg Reg Ex Ex ld Load Miss "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

  10. Outline • Architectural and software support needed • Performance optimization Opportunities • Complexity reduction opportunities • Evaluation • Conclusion "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

  11. Evaluation Environment • Simulation strives to model EDA very faithfully • Value driven execution for optimistic core • Data values in the caches • Faithful simulation of branches • Scheduling replays • Prefetch modeling fidelity • Stream prefetcher • Power modeling – both switching and leakage • SPEC CPU2000 and SPLASH(2) benchmark suite • System Configuration – loosely based on Power4 • ROB/Register (INT, FP) – 128/(32, 32) • L0 cache – 16KB, 4-way, 2 cycle • L1 cache – 32KB, 4-way, 2 cycle • L2 cache – 1MB, 8-way, 400 cycle • BOQ – 512 entry • Register copy latency – 32 cycles "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

  12. Performance gain of optimism speedup speedup "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

  13. Effect on explicitly parallel programs speedup Exploiting ILP is not guaranteed to be less effective than exploiting thread level parallelism "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

  14. Energy Implications Reasons • Skeleton not the entire program • Few wrong path instructions in CC • Smaller cache hierarchy in OC • Reduce energy waste due to idling "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

  15. Performance impact with reduction in in-flight capacity "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

  16. Impact of simplifications and conservativeness Removing Load-hit speculation Making out-of-order INT issue queue in-order 10% clock freq. reduction "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

  17. Other details in the paper • Related work discussion • Quantitative comparison with past works • Details on skeleton construction • Eliminating useless branches • Delayed release of prefetches • Understanding sensitivity to performance domain errors • System diagnosis * More details left in the technical report version "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

  18. Conclusion • Performance-correctness explicitly-decoupled arch. • Independent focus on performance and correctness goals • Each goal can be achieved more efficiently with less complexity • Demonstrated a concrete design with efficient lookahead • Achieves good performance boosting • Does not consume excessive energy • Better tolerance to conservatism • Future work • Optimization beyond ILP lookahead • Custom design of optimistic and correctness core "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

  19. A Performance-Correctness Explicitly-Decoupled Architecture Alok Garg and Michael Huang Department of Electrical & Computer Engineering University of Rochester Link to technical report: http://www.ece.rochester.edu/~garg/documents/micro08tr.pdf

  20. Related Work • Dynamic verification using DIVA checker [austin99] • Lookahead techniques • Two-pass execution [sundaramoorthy00], [purser00], [zhou05], [barnes03], [mesa-martinez07], [greskamp07] • Helper-threading [dubois98], [annavaram01], [luk01], [zilles01], [chappell99], [collins01], [roth01], [moshovos01], [farcy98] • Enhancing processor’s capability to buffer more in-flight instructions [balasubramonian00], [lebeck02], [torres05], [gandhi05], [akkary03], [sethumadhavan03] • Runahead execution [mutlu03], [dundas97], [ceze04], [kirman05] • Parallelization oriented techniques [zilles02], [balakrishnan06] "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

  21. Differences from DIVA DIVA Decoupling DIVA Checker & commit Traditional Core Communication decoded instruction input and output Values have to produce correct output frequent repairment free to perform risky optimizations low bandwidth hints infrequent repairment Explicit Decoupling (EDA) "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

  22. Comparison with DCE "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

  23. Sensitivity to performance domain circuit errors "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

  24. youngest age oldest … ld7 Load Queue dispatch … st5 st3 st1 Store Queue Load-Store queue simplification Store-load replay ld7 st5 • Load queue removed • Store-load replay support not required • Priority logic replaced with simpler forwarding logic "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

More Related