200 likes | 611 Views
Motivation. High power has implications beyond energy and heatWear-out: exponentially related to temperatureTiming: temperature dependentInfant mortality: high leakage impacts burn-inMany reliability enhancing methods increase powerRedundant executionMany power reduction methods decrease reliabilityAggressive design; e.g., voltage under-scalingAdaptive methods cause thermal cyclingMust treat power and reliability together.
E N D
1. Reliability: The Next Frontier for Power-Efficient Design Sarita V. Adve
University of Illinois at Urbana-Champaign
with Vikram Adve, Pradip Bose, Man-Lap Li, Pradeep Ramachandran, Jude Rivers, Jayanth Srinivasan, Yuanyuan Zhou
UIUC/IBM Watson
3. Where to Address the Problem? Traditional approach
Device-level, circuit-level, manufacturing solutions
Application-oblivious, worst-case solutions
Architecture-level
Exploit application variability
Differentiate microarchitecture level structures
Software-level
Even more application-aware, customizable, flexible
Holy grail: Cross-layer solution for all constraints
4. Two Reliability Solutions and Implications Architecture-level solution for wear-out
Adapts to avoid faults
Focused on wear-out
Co-designed solution for reliability
Allows faults to propagate to software
Detects and recovers from errors
Handles all faults through common mechanisms
Both
allow better than worst-case design
have implications for power management
5. Lifetime Reliability or Wear-Out Failures from wear-out due to normal operation
Gate oxide breakdown
Negative bias temperature instability (NBTI)
Electromigration
Stress migration
Thermal cycling
Aggravated by
High temperature, low dimensions, voltage, electric field, power management
6. Lifetime Reliability-Aware Architecture Dynamic Reliability Management (DRM)
Design for expected operation (temp, voltage, activity)
Unexpected case: adapt to maintain target reliability
- Temporarily switch off core, reschedule apps, reduce voltage/frequency, …
+ Switch on cores, increase voltage/frequency, …
7. Dynamic Reliability Management (DRM) Make sure you say all this for a fixed target FIT.
DRM not only provides cost-performance tradeoffs, but also allows us to enables us to meet reliability targets.Make sure you say all this for a fixed target FIT.
DRM not only provides cost-performance tradeoffs, but also allows us to enables us to meet reliability targets.
8. Implications for Power Management Dynamic thermal, energy management similar to DRM, but not same
Leverage common mechanisms
But optimize different objectives
Dynamic reliability vs. thermal management
Both control temperature
But DRM, DTM reach different configurations for same thermal design point/reliability qualification temperature
Dynamic reliability vs. energy management
Both can be viewed as power related resource management
But different objectives
How to optimize everything together
9. Some Projects on Unified Adaptation Component performance + energy [ASPLOS’02,’04, ISCA’05]
Minimize storage energy with guaranteed performance
Minimize processor energy for multimedia for given performance
Performance + energy for multicore [Sigmetrics’07]
Account for adaptations on impact of synchronization delays
Cross-component performance + energy [TACO’07]
Minimize total processor and memory energy for given perf
Cross-layer performance + energy + n/w reliability for multimedia
GRACE project [MMCN’03, Trans on Mobile Computing’06, …]
Coordinated adaptations in hardware, network, OS, applications
Hierarchical control algorithms
Adapt at multiple time scales and system granularities
10. A More General Reliability Solution - Motivation Failures will happen in the field
Infant mortality, design bugs, software failures
Aggressive design for power/performance/reliability
Common, low-cost method to detect/recover from all failure sources?
Handle inadvertent failures
Allow more aggressive design for power/performance/reliability
11. A Low-Cost, Unified Reliability Solution Traditional solutions (e.g. DMR) too expensive
Must incur low performance, power overhead
One-size-fits-all near-100% coverage often unnecessary
Need handle only faults that propagate to software
Hardware errors appear as software bugs
Leverage software reliability solutions for hardware?
12. Unified Framework for H/W + S/W Reliability Unified hardware/software co-designed framework
Tackles hardware and software faults
Software-centric solutions with near-zero h/w overhead
Customizable to app needs, flexible for new error sources
13. Framework Components Detection: Software symptoms, online testing
Recovery: Software/hardware checkpoint and rollback
Diagnosis: Firmware layer for rollback/replay, online testing
Repair/reconfiguration: Redundant, reconfigurable hardware
Need to understand how hardware faults propagate to s/w
How do hardware faults become visible to software?
What is the latency?
Do h/w faults affect application and/or system state?
14. Methodology Microarchitecture-level fault injection
Trade-off between accuracy and simulation time
GEMS timing models for out-of-order processor, memory
Simics full-system simulation of Solaris + UltraSPARC III
SPEC workloads for ten million instructions
Fault model
Stuck-at, bridging faults in many micro-arch structures
Fault detection
Crashes detected through system-level fatal traps
Misaligned memory access, RED state, watchdog reset, etc.
Hangs detected using simple hardware hang detector
15. How do Hardware Faults Propagate to Software? 98% faults (w/o FPU) detectable with simple H/W & S/W
Need H/W support or S/W monitoring for FPU
> 50% crashes/hangs in OS
16. Software Components Corrupted 80% of faults (w/o Decoder, ROB) corrupt system state
Need to recover system state For each faulty microarch structure, graph shows the percentage of crashes and hangs (left and right bars) that corrupt only application state (arch state corruption), system state and none. None is because the instruction writes to a wrong physical dest reg, but that reg is part of the arch state (as the mapping in the instruction/RAT also changed). However, they lead to crashes as the dependent instruction sits at the head of the ROB, waiting for operand register to become ready, which never happensFor each faulty microarch structure, graph shows the percentage of crashes and hangs (left and right bars) that corrupt only application state (arch state corruption), system state and none. None is because the instruction writes to a wrong physical dest reg, but that reg is part of the arch state (as the mapping in the instruction/RAT also changed). However, they lead to crashes as the dependent instruction sits at the head of the ROB, waiting for operand register to become ready, which never happens
17. Latency to Detection from Application Corruption Many cases < 100K instructions, amenable to hardware recovery
Buffering for 50µs on 2 GHz processor
May need to use software checkpoint/recovery for others
18. Latency to Detection from OS Corruption Mostly < 100K OS instructions
Amenable to hardware recovery
19. Unified Reliability Framework: Summary Hardware faults highly software visible
Over 98% of faults in 6 structures result in crashes/hangs
Future: confirm w/ low-level simulations, better workloads, more fault models
Simple H/W and S/W sufficient to detect most faults
Future: More support for other faults (e.g., invariant detection)
Recovery through checkpointing
S/W and/or H/W checkpoints for application recovery
H/W checkpoints and buffering for OS recovery
Future: Application-customizability, validation
Other future work: Diagnosis and repair
20. Summary Power and reliability intricately connected
DRM: Adaptive method for cost/perf/wear-out tradeoffs
Mechanisms shared with dynamic energy/thermal mgmt
But need control to treat all together for full system
Unified, low-cost, system framework for h/w + s/w reliability
Treat hardware faults as software bugs
Can enable aggressive design for power/perf/reliability