240 likes | 369 Views
A Performance-Correctness Explicitly-Decoupled Architecture. Alok Garg and Michael Huang Department of Electrical & Computer Engineering University of Rochester. Motivation. Performance optimization in a monolithic micro-architecture is difficult
E N D
A Performance-Correctness Explicitly-Decoupled Architecture Alok Garg and Michael Huang Department of Electrical & Computer Engineering University of Rochester
Motivation • Performance optimization in a monolithic micro-architecture is difficult • Conservativeness in design reduces the common case efficacy • Want to explicitly decouple correctness & performance • Optimization 1 (e.g. branch prediction) • Optimization 2 (e.g. out-of-order execution) IF MEM EX ID WB "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
Explicitly-Decoupled Architecture (EDA) Software layer • Design separated into performance and correctness domains • Implementation decoupled as well • Optimistic design of entire system stack • Economic correctness guarantee • Custom software-hardware interface Architectural layer Performance Domain Correctness Domain Simple throughput- oriented Optimistic core Correctness core Hints Device layer "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
Branch Outcome Queue L0 L1 L2 ILP lookahead using EDA • Autonomy • Managing deviance Lookahead agent Throughput engine Program (semantic) binary Program (semantic) binary Performance domain Correctness domain Static binary transformation Skeleton Optimistic core Correctness core Minimal mutual dependence "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
Outline • Architectural and software support needed • Performance optimization opportunities • Complexity reduction opportunities • Evaluation • Conclusion "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
Avoiding L2-miss stalls in lookahead • Feed arbitrary value • Exact value may not matter • Conventional mechanism • Planning against contingency • Tagging entire dependency chain as invalid • State check-point and recovery • Type of value substitution • Value predictor • Explicitly flush the dependence chain of load • Opportunity : simple “0” value substitution • Only used when optimistic core is not too far ahead • Zero most frequent occurring value compare (x>f0) "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
Purging stale data • Source of stale data • Performance optimizations • Binary optimizations • Potential Solutions • Timer based eviction mechanism • Selective L0 invalidations from skeleton • Choice : do nothing • Simply rely on cache replacement OC CC L0 L1 L2 "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
Complexity reduction • Optimistic core – tradeoff complexity to improve performance • E.g., Load Store Queue • Correctness core – throughput oriented design • Accurate branch prediction from OC • No check-pointing and selective pipeline flush required • Cache misses are significantly mitigated • Latency of various operations is less critical "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
Load-Hit speculation Processor Pipeline Issue Reg Reg Ex Ex ld Load Miss "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
Outline • Architectural and software support needed • Performance optimization Opportunities • Complexity reduction opportunities • Evaluation • Conclusion "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
Evaluation Environment • Simulation strives to model EDA very faithfully • Value driven execution for optimistic core • Data values in the caches • Faithful simulation of branches • Scheduling replays • Prefetch modeling fidelity • Stream prefetcher • Power modeling – both switching and leakage • SPEC CPU2000 and SPLASH(2) benchmark suite • System Configuration – loosely based on Power4 • ROB/Register (INT, FP) – 128/(32, 32) • L0 cache – 16KB, 4-way, 2 cycle • L1 cache – 32KB, 4-way, 2 cycle • L2 cache – 1MB, 8-way, 400 cycle • BOQ – 512 entry • Register copy latency – 32 cycles "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
Performance gain of optimism speedup speedup "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
Effect on explicitly parallel programs speedup Exploiting ILP is not guaranteed to be less effective than exploiting thread level parallelism "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
Energy Implications Reasons • Skeleton not the entire program • Few wrong path instructions in CC • Smaller cache hierarchy in OC • Reduce energy waste due to idling "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
Performance impact with reduction in in-flight capacity "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
Impact of simplifications and conservativeness Removing Load-hit speculation Making out-of-order INT issue queue in-order 10% clock freq. reduction "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
Other details in the paper • Related work discussion • Quantitative comparison with past works • Details on skeleton construction • Eliminating useless branches • Delayed release of prefetches • Understanding sensitivity to performance domain errors • System diagnosis * More details left in the technical report version "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
Conclusion • Performance-correctness explicitly-decoupled arch. • Independent focus on performance and correctness goals • Each goal can be achieved more efficiently with less complexity • Demonstrated a concrete design with efficient lookahead • Achieves good performance boosting • Does not consume excessive energy • Better tolerance to conservatism • Future work • Optimization beyond ILP lookahead • Custom design of optimistic and correctness core "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
A Performance-Correctness Explicitly-Decoupled Architecture Alok Garg and Michael Huang Department of Electrical & Computer Engineering University of Rochester Link to technical report: http://www.ece.rochester.edu/~garg/documents/micro08tr.pdf
Related Work • Dynamic verification using DIVA checker [austin99] • Lookahead techniques • Two-pass execution [sundaramoorthy00], [purser00], [zhou05], [barnes03], [mesa-martinez07], [greskamp07] • Helper-threading [dubois98], [annavaram01], [luk01], [zilles01], [chappell99], [collins01], [roth01], [moshovos01], [farcy98] • Enhancing processor’s capability to buffer more in-flight instructions [balasubramonian00], [lebeck02], [torres05], [gandhi05], [akkary03], [sethumadhavan03] • Runahead execution [mutlu03], [dundas97], [ceze04], [kirman05] • Parallelization oriented techniques [zilles02], [balakrishnan06] "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
Differences from DIVA DIVA Decoupling DIVA Checker & commit Traditional Core Communication decoded instruction input and output Values have to produce correct output frequent repairment free to perform risky optimizations low bandwidth hints infrequent repairment Explicit Decoupling (EDA) "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
Comparison with DCE "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
Sensitivity to performance domain circuit errors "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
youngest age oldest … ld7 Load Queue dispatch … st5 st3 st1 Store Queue Load-Store queue simplification Store-load replay ld7 st5 • Load queue removed • Store-load replay support not required • Priority logic replaced with simpler forwarding logic "A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008