reliability: the next frontier for power-efficient design

1. Reliability: The Next Frontier for Power-Efficient Design Sarita V. Adve University of Illinois at Urbana-Champaign with Vikram Adve, Pradip Bose, Man-Lap Li, Pradeep Ramachandran, Jude Rivers, Jayanth Srinivasan, Yuanyuan Zhou UIUC/IBM Watson

3. Where to Address the Problem? Traditional approach Device-level, circuit-level, manufacturing solutions Application-oblivious, worst-case solutions Architecture-level Exploit application variability Differentiate microarchitecture level structures Software-level Even more application-aware, customizable, flexible Holy grail: Cross-layer solution for all constraints

4. Two Reliability Solutions and Implications Architecture-level solution for wear-out Adapts to avoid faults Focused on wear-out Co-designed solution for reliability Allows faults to propagate to software Detects and recovers from errors Handles all faults through common mechanisms Both allow better than worst-case design have implications for power management

5. Lifetime Reliability or Wear-Out Failures from wear-out due to normal operation Gate oxide breakdown Negative bias temperature instability (NBTI) Electromigration Stress migration Thermal cycling Aggravated by High temperature, low dimensions, voltage, electric field, power management

6. Lifetime Reliability-Aware Architecture Dynamic Reliability Management (DRM) Design for expected operation (temp, voltage, activity) Unexpected case: adapt to maintain target reliability - Temporarily switch off core, reschedule apps, reduce voltage/frequency, � + Switch on cores, increase voltage/frequency, �

7. Dynamic Reliability Management (DRM) Make sure you say all this for a fixed target FIT. DRM not only provides cost-performance tradeoffs, but also allows us to enables us to meet reliability targets.Make sure you say all this for a fixed target FIT. DRM not only provides cost-performance tradeoffs, but also allows us to enables us to meet reliability targets.

8. Implications for Power Management Dynamic thermal, energy management similar to DRM, but not same Leverage common mechanisms But optimize different objectives Dynamic reliability vs. thermal management Both control temperature But DRM, DTM reach different configurations for same thermal design point/reliability qualification temperature Dynamic reliability vs. energy management Both can be viewed as power related resource management But different objectives How to optimize everything together

9. Some Projects on Unified Adaptation Component performance + energy [ASPLOS�02,�04, ISCA�05] Minimize storage energy with guaranteed performance Minimize processor energy for multimedia for given performance Performance + energy for multicore [Sigmetrics�07] Account for adaptations on impact of synchronization delays Cross-component performance + energy [TACO�07] Minimize total processor and memory energy for given perf Cross-layer performance + energy + n/w reliability for multimedia GRACE project [MMCN�03, Trans on Mobile Computing�06, �] Coordinated adaptations in hardware, network, OS, applications Hierarchical control algorithms Adapt at multiple time scales and system granularities

10. A More General Reliability Solution - Motivation Failures will happen in the field Infant mortality, design bugs, software failures Aggressive design for power/performance/reliability Common, low-cost method to detect/recover from all failure sources? Handle inadvertent failures Allow more aggressive design for power/performance/reliability

11. A Low-Cost, Unified Reliability Solution Traditional solutions (e.g. DMR) too expensive Must incur low performance, power overhead One-size-fits-all near-100% coverage often unnecessary Need handle only faults that propagate to software Hardware errors appear as software bugs Leverage software reliability solutions for hardware?

12. Unified Framework for H/W + S/W Reliability Unified hardware/software co-designed framework Tackles hardware and software faults Software-centric solutions with near-zero h/w overhead Customizable to app needs, flexible for new error sources

13. Framework Components Detection: Software symptoms, online testing Recovery: Software/hardware checkpoint and rollback Diagnosis: Firmware layer for rollback/replay, online testing Repair/reconfiguration: Redundant, reconfigurable hardware Need to understand how hardware faults propagate to s/w How do hardware faults become visible to software? What is the latency? Do h/w faults affect application and/or system state?

14. Methodology Microarchitecture-level fault injection Trade-off between accuracy and simulation time GEMS timing models for out-of-order processor, memory Simics full-system simulation of Solaris + UltraSPARC III SPEC workloads for ten million instructions Fault model Stuck-at, bridging faults in many micro-arch structures Fault detection Crashes detected through system-level fatal traps Misaligned memory access, RED state, watchdog reset, etc. Hangs detected using simple hardware hang detector

15. How do Hardware Faults Propagate to Software? 98% faults (w/o FPU) detectable with simple H/W & S/W Need H/W support or S/W monitoring for FPU > 50% crashes/hangs in OS

16. Software Components Corrupted 80% of faults (w/o Decoder, ROB) corrupt system state Need to recover system state For each faulty microarch structure, graph shows the percentage of crashes and hangs (left and right bars) that corrupt only application state (arch state corruption), system state and none. None is because the instruction writes to a wrong physical dest reg, but that reg is part of the arch state (as the mapping in the instruction/RAT also changed). However, they lead to crashes as the dependent instruction sits at the head of the ROB, waiting for operand register to become ready, which never happensFor each faulty microarch structure, graph shows the percentage of crashes and hangs (left and right bars) that corrupt only application state (arch state corruption), system state and none. None is because the instruction writes to a wrong physical dest reg, but that reg is part of the arch state (as the mapping in the instruction/RAT also changed). However, they lead to crashes as the dependent instruction sits at the head of the ROB, waiting for operand register to become ready, which never happens

17. Latency to Detection from Application Corruption Many cases < 100K instructions, amenable to hardware recovery Buffering for 50�s on 2 GHz processor May need to use software checkpoint/recovery for others

18. Latency to Detection from OS Corruption Mostly < 100K OS instructions Amenable to hardware recovery

19. Unified Reliability Framework: Summary Hardware faults highly software visible Over 98% of faults in 6 structures result in crashes/hangs Future: confirm w/ low-level simulations, better workloads, more fault models Simple H/W and S/W sufficient to detect most faults Future: More support for other faults (e.g., invariant detection) Recovery through checkpointing S/W and/or H/W checkpoints for application recovery H/W checkpoints and buffering for OS recovery Future: Application-customizability, validation Other future work: Diagnosis and repair

20. Summary Power and reliability intricately connected DRM: Adaptive method for cost/perf/wear-out tradeoffs Mechanisms shared with dynamic energy/thermal mgmt But need control to treat all together for full system Unified, low-cost, system framework for h/w + s/w reliability Treat hardware faults as software bugs Can enable aggressive design for power/perf/reliability

reliability: the next frontier for power-efficient design

reliability: the next frontier for power-efficient design

Presentation Transcript

IV. Calculating the Efficient Frontier

Software Cybernetics: The Next Frontier

Lunar Exploration: The Next Frontier

The next frontier

Finding the next frontier

Solar Energy: The Next Frontier

The Next Frontier

The Next Frontier for Laboratories - Going Green

Efficient Portfolio Frontier

Assessment 2.0: The Next Frontier

Reliability Engineering for Next Generation Electric Power Systems

Green Cities: The Next Urban Design Frontier

DESIGN FOR RELIABILITY

ePAR: The Next Frontier

The Next Frontier for Kansas Rural Development

Efficient Methodologies for Reliability Based Design Optimization

The Intensity Frontier The Next Challenge for Fermilab

HYBRID CLOUD: The Next Frontier

Cyber Exposure – The Next Frontier

Oceans – the next frontier

Bluesniff - The Next Wardriving Frontier

The Next Quantum Frontier