650 likes | 798 Views
SWAT: Designing Resilient Hardware by Treating Software Anomalies. Byn Choi , Siva Hari , Man-Lap (Alex) Li, Pradeep Ramachandran , Swarup Sahoo , Sarita Adve , Vikram Adve , Shobha Vasudevan , Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign
E N D
SWAT: Designing Resilient Hardware byTreating Software Anomalies BynChoi, Siva Hari, Man-Lap (Alex) Li, PradeepRamachandran, SwarupSahoo, SaritaAdve, VikramAdve, ShobhaVasudevan, Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign swat@cs.uiuc.edu
Transient errors (High-energy particles ) Wear-out (Devices are weaker) Design Bugs Motivation • Hardware will fail in-the-field due to several reasons • Need in-field detection, diagnosis, recovery, repair • Reliability problem pervasive across many markets • Traditional redundancy solutions (e.g., nMR) too expensive Need low-cost solutions for multiple failure sources • Must incur low area, performance, power overhead … and so on
Observations • Need handle only hardware faults that propagate to software • Fault-free case remains common, must be optimized Watch for software anomalies (symptoms) • Zero to low overhead “always-on” monitors Diagnose cause after symptom detected • May incur high overhead, but rarely invoked SWAT: SoftWareAnomaly Treatment
Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair SWAT Framework Components • Detection: Symptoms of software misbehavior • Recovery: Checkpoint and rollback • Diagnosis: Rollback/replay on multicore • Repair/reconfiguration: Redundant, reconfigurable hardware • Flexible control through firmware
Advantages of SWAT • Handles all faults that matter • Oblivious to low-level failure modes and masked faults • Low, amortized overheads • Optimize for common case, exploit SW reliability solutions • Customizable and flexible • Firmware control adapts to specific reliability needs • Holistic systems view enables novel solutions • Synergistic detection, diagnosis, recovery solutions • Beyond hardware reliability • Long term goal: unified system (HW+SW) reliability • Potential application to post-silicon test and debug
Very low-cost detectors [ASPLOS’08, DSN’08] Low SDC rate, latency Accurate fault modeling [HPCA’09] In-situ diagnosis [DSN’08] Multithreaded workloads [MICRO’09] SWAT Contributions • Application-Aware SWAT • Even lower SDC, latency Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair
Very low-cost detectors [ASPLOS’08, DSN’08] Low SDC rate, latency Accurate fault modeling [HPCA’09] In-situ diagnosis [DSN’08] Multithreaded workloads [MICRO’09] This Talk • Application-Aware SWAT • Even lower SDC, latency Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair
Outline Introduction to SWAT SWAT Detection SWAT Diagnosis Analysis of Recovery in SWAT Conclusions Future work
Fault Detection w/ HW Detectors[ASPLOS ’08] Fatal Traps App Abort Division by zero, RED state, etc. Hangs High OS Kernel Panic Application abort due to fault Simple HW hang detector SWAT firmware High contiguous OS activity OS enters panic State due to fault • Simple HW-only detectors to observe anomalous SW behavior • Minimal hardware area low cost detectors • Incur near-zero perf overhead in fault-free operation • Require no changes to SW
Fault Detection w/ SW-assisted Detectors • Simple HW detectors effective, require no SW changes • SW-assisted detectors to augment HW detectors • Minimal changes to SW for more effective detectors • Amortize resiliency cost with SW bug detection • Explored two simple SW-assisted schemes • Detecting out-of-bounds addresses • Low HW overhead, near-zero impact on performance • Using likely program invariants • Instrumented binary, no HW changes • <5% performance overhead on x86 processors
App Address Space Empty App Code Globals Heap Libraries Stack Reserved Fault Detection w/ SW-assisted Detectors • Address out-of-bounds Detector • Monitor boundaries of heap, stack, globals • Address beyond these bounds HW fault • HW-only detect such faults at longer latency • iSWAT: Using likely program invariants to detect HW faults • Mine “likely” invariants on data values • E.g., 10 ≤ x ≤ 20 • Hold on observed inputs, expected to hold on others • Violation of likely invariant HW fault • Useful to detect faults that affect only data • iSWAT not explored in this talk [Sahoo et al., DSN ‘08]
Fault 10M instr If no symptom in 10M instr, run to completion Timing simulation Functional simulation Masked or Silent Data Corruption (SDC) Evaluating Fault Detection • Microarchitecture-level fault injection (latch elements) • GEMS timing models + Simics full-system simulation • All SPEC 2k C/C++ workloads in 64-bit OpenSolaris OS • Stuck-at, transient faults in 8 µarch units (single fault model) • 10,000 of each type statistically significant • Simulate impact of fault in detail for 10M instructions • Metrics: SDC rate, detection latency
SDC Rate of HW-only Detectors • Simple detectors give 0.7% SDC rate for permanent faults • Faults in FPU need better detectors • Mostly corrupt only data iSWAT may detect
SDC Rate of HW-only Detectors • Transient faults also have low SDC rate of 0.3% • High rate of masking from transients • Consistent with prior work on transients
Application-Aware SDC Analysis should not • SDCs undetected faults that corrupt only data values • SWAT detectors catch other corruptions • Most faults do not corrupt only data values • But some “SDCs” are actually acceptable outputs! • Traditionally, SDC output differs from fault-free output • But different outputs may still be acceptable • Diff solutions, diff solutions with degraded quality, etc. • E.g., Same cost place & route, acceptable PSNR, etc. • SWAT detectors cannot detect acceptable changes in output • For each app, define % degradation in output quality
Application-Aware SDC Analysis • 10/16 SPEC have multiple correct solutions (results for all) • App-aware analysis remarkably low SDC rate for SWAT • Only 28 faults show >0% degradation from golden output • 10 of >16,000 injected faults are SDC at >1% degradation • Ongoing work: Formalization of why/when SWAT works
Detection Latency Fault Bad arch state Bad SW state Detection Recoverable chkpt Recoverable chkpt Recoverable chkpt Old latency New Latency • Detection latency dictates recoverability • Fault recoverable as long as fault-free checkpoint exists • Traditional detection lat = arch state corruption to detection • Checkpoint records bad arch state SW affected • But not all arch state corruptions affect SW output • New detection latency = SW state corruption to detection
Detection Latency • >98% of all faults detected within 10M instructions • Recoverable using HW checkpoint schemes
Detection Latency • >98% of all faults detected within 10M instructions • Recoverable using HW checkpoint schemes • Out-of-bounds detector further reduces detection latency • Many address violations longer latency detections
Detection Latency • Measuring new latency important to study recoverability • Significant differences between old and new latency
Fault Detection - Summary • Simple detectors effective in detecting HW faults • Low SDC rate even with HW-only detectors • Short detection latencies for hardware faults • SW-assisted out-of-bounds detector latency further • Measuring new detection latency important for recovery • Next: Diagnosis of detected faults
Fault Diagnosis • Symptom-based detection is cheap but • May incur long latency from activation to detection • Difficult to diagnose root cause of fault • Goal: Diagnose the fault with minimal hardware overhead • Rarely invoked higher perf overhead acceptable PermanentFault SW Bug Transient Fault ? Symptom
SWAT Single-threaded Fault Diagnosis [Li et al., DSN ‘08] Traditional DMR Synthesized DMR P1 P1 P2 P2 P1 = = Fault-free Always on expensive DMR only on fault • First, diagnosis for single threaded workload on one core • Multithreaded w/ multicore later – several new challenges Key ideas • Single core fault model, multicorefault-free core available • Chkpt/replay for recovery replay on good core, compare • Synthesizing DMR, but only for diagnosis
Symptom No symptom Transient or non- deterministic s/w bug Continue Execution No symptom Permanent h/w fault, needs repair! SW Bug vs. Transient vs. Permanent • Rollback/replay on same/different core • Watch if symptom reappears Faulty Good Symptom detected Rollback on faulty core Deterministic s/w or Permanent h/w bug Rollback/replay on good core Symptom Deterministic s/w bug (send to s/wlayer)
µarch-level Fault Diagnosis Symptom detected Diagnosis Permanent fault Software bug Transient fault Microarchitecture-level Diagnosis Unit X is faulty
Trace Based Fault Diagnosis (TBFD) • µarch-level fault diagnosis using rollback/replay • Key: Execution caused symptom trace activates fault • Deterministically replay trace on faulty, fault-free cores • Divergence faulty hardware used diagnosis clues • Diagnose faults to µarch units of processor • Check µarch-level invariants in several parts of processor • Diagnosis in out-of-order logic (meta-datapath) complex
Trace-Based Fault Diagnosis: Evaluation • Goal: Diagnose faults at reasonable latency • Faults diagnosed in 10 SPEC workloads • ~8500 detected faults (98% of unmasked) • Results • 98% of the detection successfully diagnosed • 91% diagnosed within 1M instr (~0.5ms on 2GHz proc)
SWAT Multithreaded Fault Diagnosis [Hari et al., MICRO ‘09] Core 2 Core 1 Fault Store Load Memory Symptom Detection on a fault-free core Challenge 1: Deterministic replay involves high overhead Challenge 2: Multithreaded apps share data among threads Symptom causing core may not be faulty No known fault-free core in system
mSWATDiagnosis - Key Ideas Multithreaded applications Full-system deterministic replay No known good core Challenges Isolated deterministic replay Emulated TMR Key Ideas A B C D A B C D TA TB TC TD TA TB TC TD TA TA TA
mSWATDiagnosis - Key Ideas Multithreaded applications Full-system deterministic replay No known good core Challenges Isolated deterministic replay Emulated TMR Key Ideas A B C D A B C D TA TB TC TD TC TD TA TB TA TD TC TA TB TC TD TB TA
mSWAT Diagnosis: Evaluation • Diagnose detected perm faults in multithreaded apps • Goal: Identify faulty core, TBFD for µarch-level diagnosis • Challenges: Non-determinism, no fault-free core known • ~4% of faults detected from fault-free core • Results • 95% of detected faults diagnosed • All detections from fault-free core diagnosed • 96% of diagnosed faults require <200KB buffers • Can be stored in lower level cache low HW overhead • SWAT diagnosis can work with other symptom detectors
SWAT Recovery Recovery • Checkpoint/replay • Rollback to pristine state,re-execute • I/O buffering • Prevent irreversible effects • Recovery masks effect of fault for continuous operation • Checkpointing “always-on” must incur minimal overhead • Low area overhead, minimal performance impact • SWAT symptom detection assumes checkpoint recovery • Fault allowed to corrupt the architecture state
Components of Recovery Device I/O ST ST ST old val Registers Snapshot 3 Registers Snapshot 1 Registers Snapshot 2 old val I/O Buffer 1 Memory Log 1 Memory Log 2 Commit I/O • Checkpointing • Periodic snapshot of registers, undo log for memory • Restore register/memory state up on detection • I/O buffering • External outputs buffered until known to be fault-free • HW buffer to record I/O until next checkpoint interval
Analysis of Recovery Overheads [Lead by Alex Li] • Goal: Measure overheads from checkpointing, I/O buffering • Measured on 2 server applications – apache, sshd • ReVive for chkpt, several techniques for I/O buffering • State-of-the-art incurs high overhead at short chkpt intervals • >30% performance overhead at interval of <1M cycles! • Long chkpt interval I/O buffering incurs high HW overhead • Checkpoint intervals of 10M HW buffer of 100KB • Push and pull effect between recovery components • Ongoing work: SWAT recovery module with low overheads
Very low-cost detectors [ASPLOS’08, DSN’08] Low SDC rate, latency Accurate fault modeling [HPCA’09] In-situ diagnosis [DSN’08] Multithreaded workloads [MICRO’09] Summary: SWAT works! • Application-Aware SWAT • Even lower SDC, latency Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair
SWAT Advantages and Limitations • Advantages • Handles all faults that matter, oblivious to failure modes • Low, amortized overheads across HW/SW reliability • Customizable and flexible due to firmware implementation • Concepts applicable beyond hardware reliability • Limitations • SWAT reliability guarantees largely empirical • SWAT firmware, recovery module not yet ready • Off core faults, other fault models not evaluated
Future Work • Formalization of when/why SWAT works • Near zero cost recovery • More server/distributed applications • Other core and off-core parts, other fault models • Prototyping SWAT on FPGA • With T. Austin/ V. Bertacco at University of Michigan
SWAT: Designing Resilient Hardware byTreating Software Anomalies BynChoi, Siva Hari, Man-Lap (Alex) Li, PradeepRamachandran, SwarupSahoo, SaritaAdve, VikramAdve, ShobhaVasudevan, Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign swat@cs.uiuc.edu
App Address Space 0x0 Empty 0x100000000 App Code Globals Heap Libraries Stack Reserved 0xffff… (264-1) Address Out-of-Bounds Detector • Address faults may result in long detection latencies • Corrupt address unallocated but in valid page • Many data value corruptions before symptom • Low-cost address out-of-bounds detector • Can amortize cost across software bug detectors Compiler tells hardware Malloc reports to hw Limits recorded on function execution
Permanent Faults: HW-only Detectors Fatal Traps and Panics detect most faults Large fraction of detections by symptoms from OS
Measuring Detection Latency Fault Bad arch state Bad SW state Detection Fault effect masked Symptom Chkpt Chkpt Rollback & Replay Rollback & Replay New latency • New detection latency = SW state corruption to detection • But idenitfying SW state corruption is hard! • Need to know how faulty value used by application • If faulty value affects output, then SW state corrupted • Measure latency by rolling back to older checkpoints • Only for analysis, not required in real system
Extending SWAT Diagnosis to Multithreaded Apps C1 C2 C3 S Symptom Detected C1 C2 C3 S Symptom Detected Faulty core is C2 No Symptom Detected C1 C2 C3 S • Naïve extension – N known good cores to replay the trace • Too expensive – area • Requires full-system deterministic replay • Simple optimization – One spare core • Not Scalable, requires N full-system deterministic replays • Requires a spare core • Single point of failure
mSWAT Fault Diagnosis Algorithm Diagnosis Symptom detected Capture fault activating trace Re-execute Captured trace A B C D Example TA TB TC TD
mSWAT Fault Diagnosis Algorithm Diagnosis Symptom detected Capture fault activating trace Re-execute Captured trace A B C D A B C D TA TB TC TD Example
mSWAT Fault Diagnosis Algorithm Diagnosis Symptom detected Capture fault activating trace Re-execute Captured trace Faulty core Look for divergence A B C D A B C D A B C D Example TA TD TA TB TC TB TD TC TA No Divergence Divergence Faulty core is B
Recording Fault Activating Trace What info to capture for deterministic isolated replay? Capture fault activating trace Symptom Detected Deterministic isolated replay Look for divergence Faulty core Thread Ld Ld • Capture all inputs to thread as trace • Record data values of all loads • Ensures isolated deterministic replay • Isolated replay => lower overhead Ld Ld
Comparing Deterministic Replays How to identify divergence? Capture fault activating trace Symptom Detected Deterministic isolated replay Look for divergence Faulty core Store Thread Store Branch • Compare all instr Large buffer needed • Faults -> SW through branch, load, store • Other faults manifest in these instr • Record, compare only these instructions • Lower HW buffer overhead Load
mSWAT Diagnosis: Hardware Cost Capture fault activating trace Symptom Detected Deterministic isolated replay Look for divergence Faulty core Repeatedly execute on short traces e.g. 100,000 instrns • Trace captured in native execution • HW support for trace collection • Deterministic replay is firmware emulated • Requires minimal hardware support • Replay threads in isolation • No need to capture memory orderings • Long detection latency large trace buffers (8MB/core) • Iterative Diagnosis Algorithm to reduce buffer overhead
Results: MSWAT Fault Diagnosis • Over 95% of detected faults are successfully diagnosed • All faults detected in fault-free core are diagnosed