1 / 20

reliability: the next frontier for power-efficient design

Motivation. High power has implications beyond energy and heatWear-out: exponentially related to temperatureTiming: temperature dependentInfant mortality: high leakage impacts burn-inMany reliability enhancing methods increase powerRedundant executionMany power reduction methods decrease reliabilityAggressive design; e.g., voltage under-scalingAdaptive methods cause thermal cyclingMust treat power and reliability together.

Albert_Lan
Download Presentation

reliability: the next frontier for power-efficient design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Reliability: The Next Frontier for Power-Efficient Design Sarita V. Adve University of Illinois at Urbana-Champaign with Vikram Adve, Pradip Bose, Man-Lap Li, Pradeep Ramachandran, Jude Rivers, Jayanth Srinivasan, Yuanyuan Zhou UIUC/IBM Watson

    3. Where to Address the Problem? Traditional approach Device-level, circuit-level, manufacturing solutions Application-oblivious, worst-case solutions Architecture-level Exploit application variability Differentiate microarchitecture level structures Software-level Even more application-aware, customizable, flexible Holy grail: Cross-layer solution for all constraints

    4. Two Reliability Solutions and Implications Architecture-level solution for wear-out Adapts to avoid faults Focused on wear-out Co-designed solution for reliability Allows faults to propagate to software Detects and recovers from errors Handles all faults through common mechanisms Both allow better than worst-case design have implications for power management

    5. Lifetime Reliability or Wear-Out Failures from wear-out due to normal operation Gate oxide breakdown Negative bias temperature instability (NBTI) Electromigration Stress migration Thermal cycling Aggravated by High temperature, low dimensions, voltage, electric field, power management

    6. Lifetime Reliability-Aware Architecture Dynamic Reliability Management (DRM) Design for expected operation (temp, voltage, activity) Unexpected case: adapt to maintain target reliability - Temporarily switch off core, reschedule apps, reduce voltage/frequency, … + Switch on cores, increase voltage/frequency, …

    7. Dynamic Reliability Management (DRM) Make sure you say all this for a fixed target FIT. DRM not only provides cost-performance tradeoffs, but also allows us to enables us to meet reliability targets.Make sure you say all this for a fixed target FIT. DRM not only provides cost-performance tradeoffs, but also allows us to enables us to meet reliability targets.

    8. Implications for Power Management Dynamic thermal, energy management similar to DRM, but not same Leverage common mechanisms But optimize different objectives Dynamic reliability vs. thermal management Both control temperature But DRM, DTM reach different configurations for same thermal design point/reliability qualification temperature Dynamic reliability vs. energy management Both can be viewed as power related resource management But different objectives How to optimize everything together

    9. Some Projects on Unified Adaptation Component performance + energy [ASPLOS’02,’04, ISCA’05] Minimize storage energy with guaranteed performance Minimize processor energy for multimedia for given performance Performance + energy for multicore [Sigmetrics’07] Account for adaptations on impact of synchronization delays Cross-component performance + energy [TACO’07] Minimize total processor and memory energy for given perf Cross-layer performance + energy + n/w reliability for multimedia GRACE project [MMCN’03, Trans on Mobile Computing’06, …] Coordinated adaptations in hardware, network, OS, applications Hierarchical control algorithms Adapt at multiple time scales and system granularities

    10. A More General Reliability Solution - Motivation Failures will happen in the field Infant mortality, design bugs, software failures Aggressive design for power/performance/reliability Common, low-cost method to detect/recover from all failure sources? Handle inadvertent failures Allow more aggressive design for power/performance/reliability

    11. A Low-Cost, Unified Reliability Solution Traditional solutions (e.g. DMR) too expensive Must incur low performance, power overhead One-size-fits-all near-100% coverage often unnecessary Need handle only faults that propagate to software Hardware errors appear as software bugs Leverage software reliability solutions for hardware?

    12. Unified Framework for H/W + S/W Reliability Unified hardware/software co-designed framework Tackles hardware and software faults Software-centric solutions with near-zero h/w overhead Customizable to app needs, flexible for new error sources

    13. Framework Components Detection: Software symptoms, online testing Recovery: Software/hardware checkpoint and rollback Diagnosis: Firmware layer for rollback/replay, online testing Repair/reconfiguration: Redundant, reconfigurable hardware Need to understand how hardware faults propagate to s/w How do hardware faults become visible to software? What is the latency? Do h/w faults affect application and/or system state?

    14. Methodology Microarchitecture-level fault injection Trade-off between accuracy and simulation time GEMS timing models for out-of-order processor, memory Simics full-system simulation of Solaris + UltraSPARC III SPEC workloads for ten million instructions Fault model Stuck-at, bridging faults in many micro-arch structures Fault detection Crashes detected through system-level fatal traps Misaligned memory access, RED state, watchdog reset, etc. Hangs detected using simple hardware hang detector

    15. How do Hardware Faults Propagate to Software? 98% faults (w/o FPU) detectable with simple H/W & S/W Need H/W support or S/W monitoring for FPU > 50% crashes/hangs in OS

    16. Software Components Corrupted 80% of faults (w/o Decoder, ROB) corrupt system state Need to recover system state For each faulty microarch structure, graph shows the percentage of crashes and hangs (left and right bars) that corrupt only application state (arch state corruption), system state and none. None is because the instruction writes to a wrong physical dest reg, but that reg is part of the arch state (as the mapping in the instruction/RAT also changed). However, they lead to crashes as the dependent instruction sits at the head of the ROB, waiting for operand register to become ready, which never happensFor each faulty microarch structure, graph shows the percentage of crashes and hangs (left and right bars) that corrupt only application state (arch state corruption), system state and none. None is because the instruction writes to a wrong physical dest reg, but that reg is part of the arch state (as the mapping in the instruction/RAT also changed). However, they lead to crashes as the dependent instruction sits at the head of the ROB, waiting for operand register to become ready, which never happens

    17. Latency to Detection from Application Corruption Many cases < 100K instructions, amenable to hardware recovery Buffering for 50µs on 2 GHz processor May need to use software checkpoint/recovery for others

    18. Latency to Detection from OS Corruption Mostly < 100K OS instructions Amenable to hardware recovery

    19. Unified Reliability Framework: Summary Hardware faults highly software visible Over 98% of faults in 6 structures result in crashes/hangs Future: confirm w/ low-level simulations, better workloads, more fault models Simple H/W and S/W sufficient to detect most faults Future: More support for other faults (e.g., invariant detection) Recovery through checkpointing S/W and/or H/W checkpoints for application recovery H/W checkpoints and buffering for OS recovery Future: Application-customizability, validation Other future work: Diagnosis and repair

    20. Summary Power and reliability intricately connected DRM: Adaptive method for cost/perf/wear-out tradeoffs Mechanisms shared with dynamic energy/thermal mgmt But need control to treat all together for full system Unified, low-cost, system framework for h/w + s/w reliability Treat hardware faults as software bugs Can enable aggressive design for power/perf/reliability

More Related