390 likes | 512 Views
SWAT: Designing Reisilent Hardware by Treating Software Anomalies. Man-Lap (Alex) Li, Pradeep Ramachandran , Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet Ulya Karpuzcu, Sarita Adve, Vikram Adve, Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign
E N D
SWAT: Designing Reisilent Hardware byTreating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet Ulya Karpuzcu, Sarita Adve, Vikram Adve, Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign swat@cs.uiuc.edu
Motivation • Hardware failures will happen in the field • Aging, soft errors, inadequate burn-in, design defects, … Need in-field detection, diagnosis, recovery, repair • Reliability problem pervasive across many markets • Traditional redundancy (e.g., nMR) too expensive • Piecemeal solutions for specific fault model too expensive • Must incur low area, performance, power overhead Today: low-cost solution for multiple failure sources
Observations • Need handle only hardware faults that propagate to software • Fault-free case remains common, must be optimized • Watch for software anomalies (symptoms) Hardware fault detection ~ Software bug detection Zero to low overhead “always-on” monitors Diagnose cause after symptom detected May incur high overhead, but rarely invoked • SWAT: SoftWare Anomaly Treatment
SWAT Framework Components • Detection:Symptoms of S/W misbehavior, minimal backup H/W • Recovery:Hardware/Software checkpoint and rollback • Diagnosis:Rollback/replay on multicore • Repair/reconfiguration: Redundant, reconfigurable hardware • Flexible control through firmware Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair
Detectors w/ Hardware support [ASPLOS ‘08] Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair 4. Accurate Fault Modeling 3. Trace Based Fault Diagnosis [Li et al., DSN ‘08] SWAT 2. Detectors w/ Software support [Sahoo et al., DSN ‘08]
Hardware-Only Symptom-based detection • Observe anomalous symptoms for fault detection • Incur low overheads for “always-on” detectors • Minimal support from hardware • Fatal traps generatedby hardware • Division by Zero, RED State, etc. • Hangs detected using simple hardware hang detector • High OS activity detected with performance counter • Typical OS invocations take 10s or 100s of instructions
Fault 10M instr If no symptom in 10M instr, run to completion Functional simulation Timing simulation App masked, or symptom > 10M, or silent data corruption (SDC) Experimental Methodology • Microarchitecture-level fault injection • GEMS timing models + Simics full-system simulation • SPEC workloads on Solaris-9 OS • Permanent fault models • Stuck-at, bridging faults in latches of 8 arch structures • 12,800 faults, <0.3% error @ 95% confidence • Simulate impact of fault in detail for 10M instructions
Efficacy of Hardware-only Detectors • Coverage: Percentage of unmasked faults detected • 98% faults detected, 0.4% give SDC (w/o FPU) • Additional support required for FPU-like units • 66% of detected faults corrupt OS state, need recovery • Despite low OS activity in fault-free execution • Latency: Number of instr between activation and detection • HW recovery for upto 100k instr, SW longer latencies • App in 87% of detections recoverable using HW • OS recoverable in virtually all detections using HW • OS recovery using SW hard
Improving SWAT Detection Coverage Can we improve coverage, SDC rate further? • SDC faults primarily corrupt data values • Illegal control/address values caught by other symptoms • Need detectors to capture “semantic” information • Software-level invariants capture program semantics • Use when higher coverage desired • Sound program invariants expensive static analysis • We use likely program invariants
Likely Program Invariants • Likely program invariants • Hold on all observed inputs, expected to hold on others • But suffer from false positives • Use SWAT diagnosis to detect false positives on-line • iSWAT - Compiler-assisted symptom detectors • Range-based value invariants[Sahoo et al. DSN ‘08] • Check MIN value MAX on data values • Disable invariant when diagnose false-positive
iSWAT implementation Application Training Phase Compiler Pass in LLVM Test, train, external inputs Invariant Monitoring Code - - - - - Application - - - - - Ranges i/p #1 Ranges i/p #n . . . . Invariant Ranges
iSWAT implementation Application Training Phase Fault Detection Phase Compiler Pass in LLVM Compiler Pass in LLVM Invariant Checking Code - - - - - Application - - - - - Test, train, external inputs Ref input Invariant Monitoring Code - - - - - Application - - - - - Inject Faults Full System Simulation Invariant Violation Ranges i/p #1 Ranges i/p #n . . . . SWAT Diagnosis Fault Detection False Positive (Disable Invariant) Invariant Ranges
iSWAT Results • Explored SWAT with 5 apps on previous methodology • Undetected faults reduce by 30% • Invariants reduce SDCs by 73% (33 to 9) • Overheads: 5% on x86, 14% on UltraSparc IIIi • Reasonably low overheads on some machines • Un-optimized invariants used, can be further reduced • Exploring more sophistication for coverage, overheads
Fault Diagnosis • Symptom-based detection is cheap but • High latency from fault activation to detection • Difficult to diagnose root cause of fault • How to diagnose SW bug vs. transient vs. permanent fault? • For permanent fault within core • Disable entire core? Wasteful! • Disable/reconfigure µarch-level unit? • How to diagnose faults to µarch unit granularity? • Key ideas • Single core fault model, multicore fault-free core available • Checkpoint/replay for recovery replay on good core, compare • Synthesizing DMR, but only for diagnosis
Symptom No symptom Transient or non- deterministic s/w bug Continue Execution SW Bug vs. Transient vs. Permanent • Rollback/replay on same/different core • Watch if symptom reappears Faulty Good Symptom detected Rollback on faulty core False positive (iSWAT) or Deterministic s/w or Permanent h/w bug Rollback/replay on good core No symptom Symptom Permanent h/w fault, needs repair! False positive (iSWAT) or Deterministic s/w bug, send to s/w layer
Diagnosis Framework Symptom detected Diagnosis Permanent fault Software bug Transient fault Microarchitecture-Level Diagnosis Unit X is faulty
Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Faulty Core Execution Fault-Free Core Execution =? Diagnosis Algorithm
Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Fault-Free Core Execution Rollback faulty-core to checkpoint Replay execution, collect info =? Diagnosis Algorithm
What info to collect? What to do on divergence? What info to compare? Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Rollback faulty-core to checkpoint Load checkpoint on fault-free core Replay execution, collect info Fault-free instruction exec =? Diagnosis Algorithm
HW used Fault-free results Faulty dst preg dec alu x 5 3 add r1,r3,r5 add r1,r3,r5 0 1 12 x sub r6,r1,r2 2 9 sub r6,r1,r2 2 1 7 Both divergent instructions used same ALU ALU1 faulty Can a Divergent Instruction Lead to Diagnosis? Simpler case: ALU fault
p20 4 r2 p20 Fault-free r1=12 Diverged! error! p24 p24 8 3 r5 p24 r1 p4 r3 r3 p4 p55 p24 32 Can a Divergent Instruction Lead to Diagnosis? • Complex example: Fault in register alias table (RAT) entry • Divergent instructions do not directly lead to faulty unit • Instead, look backward/forward in instruction stream • Need to collect and analyze instruction trace Reg File RAT log phy phy val IA: r3 r2 + r2 r1 p4 p20 4 r2 p20 p24 3 r3 p13 p4 8 r5 p24 IB: r1 r5 * r2 But IB does not use faulty HW…
Diagnosing Permanent Fault to µarch Granularity • Trace-based fault diagnosis (TBFD) • Compare instruction trace of faulty vs. good execution • Divergence faulty hardware used diagnosis clues • Diagnose faults to µarch units of processor • Check µarch-level invariants in several parts of processor • Front end, Meta-datapath, datapath faults • Diagnosis in out-of-order logic (meta-datapath) complex • Results • 98% of the faults by SWAT successfully diagnosed • TBFD flexible for other detectors/granularity of repair
4. Accurate Fault Modeling SWAT Detectors w/ Hardware support [ASPLOS ‘08] 2. Detectors w/ Software support [Sahoo et al., DSN ‘08] Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair 3. Trace Based Fault Diagnosis [Li et al., DSN ‘08]
SWATSim: Fast and Accurate Fault Models • Need accurate µarch-level fault models • Gate level injections accurate but too slow • µarch (latch) level injections fast but inaccurate • Can we achieve µarch-level speed at gate-level accuracy? • Mix-mode (hierarchical) Simulation • µarch-level + Gate-level simulation • Simulate only faulty component at gate-level, on-demand • Invoke gate-level sim at online for permanent faults • Simulating fault effect with real-world vectors
µarch simulation r3 r1 op r2 Yes No Input Stimuli Gate-Level Fault Simulation Output Response r3 Fault propagated to output Continue µarch simulation µarch-Level Simulation SWAT-Sim: Gate-level Accuracy at µarch Speeds Faulty Unit Used?
Results from SWAT-Sim • SWAT-sim implemented within full-system simulation • NCVerilog + VPI for gate-level sim of ALU/AGEN modules • SWAT-Sim: High accuracy at low overheads • 100,000x faster than gate-level, same modeling fidelity • 2x slowdown over µarch-level, at higher accuracy • Accuracy of µarch models using SWAT coverage/latency • µarch stuck-at models generally inaccurate • Differences in activation rate, multi-bit flips • Complex manifestations Hard to derive better models • Need SWAT-Sim, at least for now
SWAT Summary • SWAT: SoftWare Anomaly Treatment • Handle all and only faults that matter • Low, amortized overheads • Holistic systems view enables novel solutions • Customizable and flexible • Prior results: • Low-cost h/w detectors gave high coverage, low SDC rate • This talk: • iSWAT: Higher coverage w/ software-assisted detectors • TBFD: µarch level fault diagnosis by synthesizing DMR • SWAT-Sim: Gate-level fault accuracy at µarch level speed
Future Work • Recovery: hybrid, application-specific • Aggressive use of software reliability techniques • Leverage diagnosis mechanism • Multithreaded software • Off-core faults • Post-silicon debug and test • Use faulty trace as fault-model oblivious test vector • Validation on FPGA (w/ Michigan) • Hardware assertions to complement software symptoms
Breakup of Detections by SW symptoms • 98% unmasked faults detected within 10M instr (w/o FPU) • Need HW support or SW monitoring for FPU
SW Components Corrupted • 66% of faults corrupt system state before detection • Need to recover system state
Latency from Application mismatch • 86% of faults detected under 100k • 42% detected under 10k
Latency from OS mismatch • 99% of faults detected under 100k
iSWAT implementation Application Training Phase Fault Detection Phase Compiler Pass in LLVM Compiler Pass in LLVM Invariant Checking Code - - - - - Application - - - - - Test, train, external inputs Ref input Invariant Monitoring Code - - - - - Application - - - - - Inject Faults Full System Simulation Invariant Violation Ranges i/p #1 Ranges i/p #n . . . . SWAT Diagnosis Fault Detection False Positive (Disable Invariant) Invariant Ranges
Meta-datapath Faults Datapath Faults Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke diagnosis Rollback faulty-core to checkpoint Load checkpoint on fault-free core Replay execution, collect µarch info Fault-free instruction exec Faulty trace =? Test trace Faults in Front-end TBFD
Fault Diagnosability • 98% of detected faults are diagnosed • 89% diagnosed to unique unit/array entry • Meta-datapath faults in out-of-order exec mislead TBFD
Accuracy of existing Fault Models • SWAT-sim implemented within full-system simulator • NCVerilog + VPI to simulate gate-level ALU and AGEN • Existing µarch-level fault models inaccurate • Differences in activation rate, multi-bsit flips • Accurate models hard to derive need SWAT-Sim!
Summary: SWAT Advantages • Handles all faults that matter • Oblivious to low-level failure modes & masked faults • Low, amortized overheads • Optimize for common case, exploit s/w reliability solutions • Holistic systems view enables novel solutions • Invariant detectors use diagnosis mechanisms • Diagnosis uses recovery mechanisms • Customizable and flexible • Firmware based control affords hybrid, app-specific recovery (TBD) • Beyond hardware reliability • SWAT treats hardware faults as software bugs • Long-term goal: unified system (hw + sw) reliability at lowest cost • Potential applications to post-silicon test and debug
Transients Results • 6400 transient faults injected across 8 structures • 83% unmasked faults detected within 10M instr • Only 0.4% of injected faults results in SDCs