1 / 39

SWAT: Designing Reisilent Hardware by Treating Software Anomalies

SWAT: Designing Reisilent Hardware by Treating Software Anomalies. Man-Lap (Alex) Li, Pradeep Ramachandran , Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet Ulya Karpuzcu, Sarita Adve, Vikram Adve, Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign

tovah
Download Presentation

SWAT: Designing Reisilent Hardware by Treating Software Anomalies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SWAT: Designing Reisilent Hardware byTreating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet Ulya Karpuzcu, Sarita Adve, Vikram Adve, Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign swat@cs.uiuc.edu

  2. Motivation • Hardware failures will happen in the field • Aging, soft errors, inadequate burn-in, design defects, … Need in-field detection, diagnosis, recovery, repair • Reliability problem pervasive across many markets • Traditional redundancy (e.g., nMR) too expensive • Piecemeal solutions for specific fault model too expensive • Must incur low area, performance, power overhead Today: low-cost solution for multiple failure sources

  3. Observations • Need handle only hardware faults that propagate to software • Fault-free case remains common, must be optimized • Watch for software anomalies (symptoms) Hardware fault detection ~ Software bug detection Zero to low overhead “always-on” monitors Diagnose cause after symptom detected May incur high overhead, but rarely invoked • SWAT: SoftWare Anomaly Treatment

  4. SWAT Framework Components • Detection:Symptoms of S/W misbehavior, minimal backup H/W • Recovery:Hardware/Software checkpoint and rollback • Diagnosis:Rollback/replay on multicore • Repair/reconfiguration: Redundant, reconfigurable hardware • Flexible control through firmware Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair

  5. Detectors w/ Hardware support [ASPLOS ‘08] Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair 4. Accurate Fault Modeling 3. Trace Based Fault Diagnosis [Li et al., DSN ‘08] SWAT 2. Detectors w/ Software support [Sahoo et al., DSN ‘08]

  6. Hardware-Only Symptom-based detection • Observe anomalous symptoms for fault detection • Incur low overheads for “always-on” detectors • Minimal support from hardware • Fatal traps generatedby hardware • Division by Zero, RED State, etc. • Hangs detected using simple hardware hang detector • High OS activity detected with performance counter • Typical OS invocations take 10s or 100s of instructions

  7. Fault 10M instr If no symptom in 10M instr, run to completion Functional simulation Timing simulation App masked, or symptom > 10M, or silent data corruption (SDC) Experimental Methodology • Microarchitecture-level fault injection • GEMS timing models + Simics full-system simulation • SPEC workloads on Solaris-9 OS • Permanent fault models • Stuck-at, bridging faults in latches of 8 arch structures • 12,800 faults, <0.3% error @ 95% confidence • Simulate impact of fault in detail for 10M instructions

  8. Efficacy of Hardware-only Detectors • Coverage: Percentage of unmasked faults detected • 98% faults detected, 0.4% give SDC (w/o FPU) • Additional support required for FPU-like units • 66% of detected faults corrupt OS state, need recovery • Despite low OS activity in fault-free execution • Latency: Number of instr between activation and detection • HW recovery for upto 100k instr, SW longer latencies • App in 87% of detections recoverable using HW • OS recoverable in virtually all detections using HW • OS recovery using SW hard

  9. Improving SWAT Detection Coverage Can we improve coverage, SDC rate further? • SDC faults primarily corrupt data values • Illegal control/address values caught by other symptoms • Need detectors to capture “semantic” information • Software-level invariants capture program semantics • Use when higher coverage desired • Sound program invariants  expensive static analysis • We use likely program invariants

  10. Likely Program Invariants • Likely program invariants • Hold on all observed inputs, expected to hold on others • But suffer from false positives • Use SWAT diagnosis to detect false positives on-line • iSWAT - Compiler-assisted symptom detectors • Range-based value invariants[Sahoo et al. DSN ‘08] • Check MIN  value  MAX on data values • Disable invariant when diagnose false-positive

  11. iSWAT implementation Application Training Phase Compiler Pass in LLVM Test, train, external inputs Invariant Monitoring Code - - - - - Application - - - - - Ranges i/p #1 Ranges i/p #n . . . . Invariant Ranges

  12. iSWAT implementation Application Training Phase Fault Detection Phase Compiler Pass in LLVM Compiler Pass in LLVM Invariant Checking Code - - - - - Application - - - - - Test, train, external inputs Ref input Invariant Monitoring Code - - - - - Application - - - - - Inject Faults Full System Simulation Invariant Violation Ranges i/p #1 Ranges i/p #n . . . . SWAT Diagnosis Fault Detection False Positive (Disable Invariant) Invariant Ranges

  13. iSWAT Results • Explored SWAT with 5 apps on previous methodology • Undetected faults reduce by 30% • Invariants reduce SDCs by 73% (33 to 9) • Overheads: 5% on x86, 14% on UltraSparc IIIi • Reasonably low overheads on some machines • Un-optimized invariants used, can be further reduced • Exploring more sophistication for coverage, overheads

  14. Fault Diagnosis • Symptom-based detection is cheap but • High latency from fault activation to detection • Difficult to diagnose root cause of fault • How to diagnose SW bug vs. transient vs. permanent fault? • For permanent fault within core • Disable entire core? Wasteful! • Disable/reconfigure µarch-level unit? • How to diagnose faults to µarch unit granularity? • Key ideas • Single core fault model, multicore  fault-free core available • Checkpoint/replay for recovery  replay on good core, compare • Synthesizing DMR, but only for diagnosis

  15. Symptom No symptom Transient or non- deterministic s/w bug Continue Execution SW Bug vs. Transient vs. Permanent • Rollback/replay on same/different core • Watch if symptom reappears Faulty Good Symptom detected Rollback on faulty core False positive (iSWAT) or Deterministic s/w or Permanent h/w bug Rollback/replay on good core No symptom Symptom Permanent h/w fault, needs repair! False positive (iSWAT) or Deterministic s/w bug, send to s/w layer

  16. Diagnosis Framework Symptom detected Diagnosis Permanent fault Software bug Transient fault Microarchitecture-Level Diagnosis Unit X is faulty

  17. Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Faulty Core Execution Fault-Free Core Execution =? Diagnosis Algorithm

  18. Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Fault-Free Core Execution Rollback faulty-core to checkpoint Replay execution, collect info =? Diagnosis Algorithm

  19. What info to collect? What to do on divergence? What info to compare? Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Rollback faulty-core to checkpoint Load checkpoint on fault-free core Replay execution, collect info Fault-free instruction exec =? Diagnosis Algorithm

  20. HW used Fault-free results Faulty dst preg dec alu x 5 3 add r1,r3,r5 add r1,r3,r5 0 1 12 x sub r6,r1,r2 2 9 sub r6,r1,r2 2 1 7 Both divergent instructions used same ALU  ALU1 faulty Can a Divergent Instruction Lead to Diagnosis? Simpler case: ALU fault

  21. p20 4 r2 p20 Fault-free r1=12 Diverged! error! p24 p24 8 3 r5 p24 r1 p4 r3 r3 p4 p55 p24 32 Can a Divergent Instruction Lead to Diagnosis? • Complex example: Fault in register alias table (RAT) entry • Divergent instructions do not directly lead to faulty unit • Instead, look backward/forward in instruction stream • Need to collect and analyze instruction trace Reg File RAT log phy phy val IA: r3  r2 + r2 r1 p4 p20 4 r2 p20 p24 3 r3 p13 p4 8 r5 p24 IB: r1  r5 * r2 But IB does not use faulty HW…

  22. Diagnosing Permanent Fault to µarch Granularity • Trace-based fault diagnosis (TBFD) • Compare instruction trace of faulty vs. good execution • Divergence  faulty hardware used  diagnosis clues • Diagnose faults to µarch units of processor • Check µarch-level invariants in several parts of processor • Front end, Meta-datapath, datapath faults • Diagnosis in out-of-order logic (meta-datapath) complex • Results • 98% of the faults by SWAT successfully diagnosed • TBFD flexible for other detectors/granularity of repair

  23. 4. Accurate Fault Modeling SWAT Detectors w/ Hardware support [ASPLOS ‘08] 2. Detectors w/ Software support [Sahoo et al., DSN ‘08] Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair 3. Trace Based Fault Diagnosis [Li et al., DSN ‘08]

  24. SWATSim: Fast and Accurate Fault Models • Need accurate µarch-level fault models • Gate level injections accurate but too slow • µarch (latch) level injections fast but inaccurate • Can we achieve µarch-level speed at gate-level accuracy? • Mix-mode (hierarchical) Simulation • µarch-level + Gate-level simulation • Simulate only faulty component at gate-level, on-demand • Invoke gate-level sim at online for permanent faults • Simulating fault effect with real-world vectors

  25. µarch simulation r3  r1 op r2 Yes No Input Stimuli Gate-Level Fault Simulation Output Response r3 Fault propagated to output Continue µarch simulation µarch-Level Simulation SWAT-Sim: Gate-level Accuracy at µarch Speeds Faulty Unit Used?

  26. Results from SWAT-Sim • SWAT-sim implemented within full-system simulation • NCVerilog + VPI for gate-level sim of ALU/AGEN modules • SWAT-Sim: High accuracy at low overheads • 100,000x faster than gate-level, same modeling fidelity • 2x slowdown over µarch-level, at higher accuracy • Accuracy of µarch models using SWAT coverage/latency • µarch stuck-at models generally inaccurate • Differences in activation rate, multi-bit flips • Complex manifestations  Hard to derive better models • Need SWAT-Sim, at least for now

  27. SWAT Summary • SWAT: SoftWare Anomaly Treatment • Handle all and only faults that matter • Low, amortized overheads • Holistic systems view enables novel solutions • Customizable and flexible • Prior results: • Low-cost h/w detectors gave high coverage, low SDC rate • This talk: • iSWAT: Higher coverage w/ software-assisted detectors • TBFD: µarch level fault diagnosis by synthesizing DMR • SWAT-Sim: Gate-level fault accuracy at µarch level speed

  28. Future Work • Recovery: hybrid, application-specific • Aggressive use of software reliability techniques • Leverage diagnosis mechanism • Multithreaded software • Off-core faults • Post-silicon debug and test • Use faulty trace as fault-model oblivious test vector • Validation on FPGA (w/ Michigan) • Hardware assertions to complement software symptoms

  29. BACKUP SLIDES

  30. Breakup of Detections by SW symptoms • 98% unmasked faults detected within 10M instr (w/o FPU) • Need HW support or SW monitoring for FPU

  31. SW Components Corrupted • 66% of faults corrupt system state before detection • Need to recover system state

  32. Latency from Application mismatch • 86% of faults detected under 100k • 42% detected under 10k

  33. Latency from OS mismatch • 99% of faults detected under 100k

  34. iSWAT implementation Application Training Phase Fault Detection Phase Compiler Pass in LLVM Compiler Pass in LLVM Invariant Checking Code - - - - - Application - - - - - Test, train, external inputs Ref input Invariant Monitoring Code - - - - - Application - - - - - Inject Faults Full System Simulation Invariant Violation Ranges i/p #1 Ranges i/p #n . . . . SWAT Diagnosis Fault Detection False Positive (Disable Invariant) Invariant Ranges

  35. Meta-datapath Faults Datapath Faults Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke diagnosis Rollback faulty-core to checkpoint Load checkpoint on fault-free core Replay execution, collect µarch info Fault-free instruction exec Faulty trace =? Test trace Faults in Front-end TBFD

  36. Fault Diagnosability • 98% of detected faults are diagnosed • 89% diagnosed to unique unit/array entry • Meta-datapath faults in out-of-order exec mislead TBFD

  37. Accuracy of existing Fault Models • SWAT-sim implemented within full-system simulator • NCVerilog + VPI to simulate gate-level ALU and AGEN • Existing µarch-level fault models inaccurate • Differences in activation rate, multi-bsit flips • Accurate models hard to derive need SWAT-Sim!

  38. Summary: SWAT Advantages • Handles all faults that matter • Oblivious to low-level failure modes & masked faults • Low, amortized overheads • Optimize for common case, exploit s/w reliability solutions • Holistic systems view enables novel solutions • Invariant detectors use diagnosis mechanisms • Diagnosis uses recovery mechanisms • Customizable and flexible • Firmware based control affords hybrid, app-specific recovery (TBD) • Beyond hardware reliability • SWAT treats hardware faults as software bugs • Long-term goal: unified system (hw + sw) reliability at lowest cost • Potential applications to post-silicon test and debug

  39. Transients Results • 6400 transient faults injected across 8 structures • 83% unmasked faults detected within 10M instr • Only 0.4% of injected faults results in SDCs

More Related