190 likes | 328 Views
Towards a Hardware-Software Co-Designed Resilient System. Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of Illinois at Urbana-Champaign In collaboration with Pradip Bose (IBM) and Subhasish Mitra (Stanford). Motivation.
E N D
Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of Illinois at Urbana-Champaign In collaboration with Pradip Bose (IBM) and Subhasish Mitra (Stanford)
Motivation • Failures will happen in the field • Design defects • Aging • Soft errors • Inadequate burn-in • Aggressive design for power/performance/reliability • … • Low-cost method to detect/recover from all sources of failure? • Reliability problem pervasive across many markets • Traditional solutions (e.g. nMR) too expensive • Must incur low performance, power overhead
A Low-Cost, Unified Reliability Solution • Need handle only faults that propagate to software • Hardware faults appear as software bugs • Leverage software reliability solutions for hardware? • One-size-fits-all near-100%coverage often unnecessary • Solution must be customizable to application needs
Outline • Motivation of Framework • Unified Framework for H/W + S/W Reliability • Understanding the Impact of H/W Failures on S/W • Future Work
CHECKPOINT CHECKPOINT CHECKPOINT Fault Fault Fault Error Error Error Symptom detected Testing Recovery Diagnosis No error Error detected Repair, recovery Repair CHECKPOINT CHECKPOINT CHECKPOINT Ideal: symptom-based detection Error undetected Detection with more overhead Unified Framework for H/W + S/W Reliability • Unified hardware/software co-designed framework • Tackles hardware and software faults • Software-centric solutions with near-zero H/W overhead • Customizable to app needs, flexible for new error sources
Framework Components • Detection: Software symptoms, online testing • Recovery: Software/hardware checkpoint and rollback • Diagnosis: Firmware layer for rollback/replay, online testing • Repair/reconfiguration: Redundant, reconfigurable hardware • Need to understand how hardware faults propagate to S/W • How do hardware faults become visible to software? • What is the latency? • Do H/W faults affect application and/or system state?
Methodology • Microarchitecture-level fault injection • Trade-off between accuracy and simulation time • GEMS timing models for out-of-order processor, memory • Simics full-system simulation of Solaris + UltraSPARC III • SPEC workloads for ten million instructions • Fault model • Stuck-at, bridging faults in many micro-arch structures • Fault detection • Crashes detected through hardware generated fatal traps • Misaligned memory access, RED state, watchdog reset, etc. • Hangs detected using simple hardware hang detector
How do Hardware Faults Propagate to Software? • 97% faults (w/o FPU) detectable with simple H/W & S/W • Need H/W support or S/W monitoring for FPU
How do Hardware Faults Propagate to Software? • 97% faults (w/o FPU) detectable with simple H/W & S/W • Need H/W support or S/W monitoring for FPU • > 50% crashes/hangs in OS
S/W Components Corrupted • 62% of faults corrupt system state • Need to recover system state
Latency to Detection from Application Corruption Total instructions executed between app state corruption and detection • 80% have latency < 100K instr, amenable to H/W recovery • Buffering for 50µs on 2 GHz processor • May need to use software checkpoint/recovery for others
Latency to Detection from OS Corruption OS-only instructions executed between OS state corruption and detection • 92% of injections result in latency of < 100K OS instructions • Amenable to hardware recovery
Summary so far • Hardware faults highly visible • Over 97% of faults in 6 structures result in crashes/hangs • Simple H/W and S/W sufficient • Recovery through checkpointing • S/W and/or H/W checkpoints for application recovery • H/W checkpoints and buffering for OS recovery
Next Steps (1 of 3) • Improving understanding of fault propagation • Accurate fault models, effect of transients, intermittents • Lower-level simulations • Better workloads • Detection • More software level monitoring • Software signals, invariants, perturbations, … • H/W support to aid detection in some structures (e.g., FPU) • Selective backup testing • Recovery • Enhanced detection may reduce latency • Explore software vs. hardware, application customizability
Next Steps (2 of 3) • Diagnosis • Assume rollback/restart mechanism, multicore system Bug detected Rollback to previous checkpoint, restart on original core Original symptom doesn’t recur Original symptom recurs • Transient h/w bug, or • non-deterministic s/w bug • Continue execution • … Deterministic s/w bug, or Permanent h/w bug Rollback, restart on different core No symptom Symptom Permanent defect in original core Deterministic s/w bug
Next Steps (3 of 3) • Repair/reconfigure • What should be the right field configurable unit? • Core, FU, array entries? • Avoidance • Dynamic reliability management • Implementation architecture • Hardware + firmware + OS • Itanium machine check architecture has hooks
Thank You Questions?
Types of fatal traps • Faults cause different fatal traps thrown before crashes • Junk data access leads to memory misalignment • Repeatedly trapping leads to RED state