1 / 34

mSWAT : Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems

mSWAT : Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems. Siva Kumar Sastry Hari , Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi, Sarita Adve Department of Computer Science University of Illinois at Urbana-Champaign swat@cs.uiuc.edu. Transient errors

gwidon
Download Presentation

mSWAT : Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. mSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi, Sarita Adve Department of Computer Science University of Illinois at Urbana-Champaign swat@cs.uiuc.edu

  2. Transient errors (High-energy particles ) Wear-out (Devices are weaker) Design Bugs Motivation • Hardware will fail in-the-field due to several reasons • Need in-field detection, diagnosis, repair, and recovery • Reliability problem pervasive across many markets • Traditional redundancy solutions (e.g., nMR) too expensive Need low-cost solutions for multiple failure sources • Must incur low area, performance, power overhead … and so on

  3. SWAT: Low-Cost Hardware Reliability SWAT Observations • Need handle only hardware faults that propagate to software • Fault-free case remains common, must be optimized SWAT Approach  Watch for software anomalies (symptoms) Zero to low overhead “always-on” monitors Diagnose cause after symptom detected May incur high overhead, but rarely invoked

  4. Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair SWAT Framework Components Detectors with simple hardware [Li et al. ASPLOS’08] µarch-level Fault Diagnosis (TBFD) [Li et al. DSN’09]

  5. Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair Challenge Detectors with simple hardware [Li et.al. ASPLOS’08] Shown to work well for single-threaded apps Does SWAT approach work on multithreaded apps? µarch-level Fault Diagnosis (TBFD) [Li et.al. DSN’09]

  6. Challenge: Data sharing in multithreaded apps • Multithreaded apps share data among threads • Does symptom detection work? • Symptom causing core may not be faulty • How to diagnose faulty core? Core 2 Core 1 Fault Error Store Load Memory Symptom Detection on a fault-free core

  7. Contributions • Evaluate SWAT detectors on multithreaded apps • Low Silent Data Corruption rate for multithreaded apps • Observed symptom from fault-free cores • Novel fault diagnosis for multithreaded apps • Identifies the faulty core despite error propagation • Provides high diagnosability

  8. Outline • Motivation • mSWAT Detection • mSWAT Diagnosis • Results • Summary and Future Work

  9. mSWAT Fault Detection • SWAT Detectors: • Low-cost monitors to detect anomalous sw behavior • Incur near-zero perf overhead in fault-free operation • Symptom detectors provide low Silent Data Corruption rate Fatal Traps App Abort Division by zero, RED state, etc. Hangs High OS Kernel Panic Application abort due to fault Simple HW hang detector SWAT firmware High contiguous OS activity OS enters panic State due to fault

  10. Symptom No symptom SWAT Fault Diagnosis • Rollback/replay on same/different core • Single-threaded application on multicore Faulty Good Symptom detected Rollback on faulty core Deterministic s/w or Permanent h/w bug Transient or Non-deterministic s/w bug Rollback/replay on good core Continue Execution No symptom Symptom Permanent h/w fault, needs repair! Deterministic s/w bug, send to s/w layer

  11. Symptom No symptom Challenges • Rollback/replay on same/different core • Single-threaded application on multicore Faulty Good Symptom detected Faulty core is unknown Rollback on faulty core No known good cores available Deterministic s/w or Permanent h/w bug Transient or Non-deterministic s/w bug How to replay multithreaded apps? Rollback/replay on good core Continue Execution No symptom Symptom Permanent h/w fault, needs repair! Deterministic s/w bug, send to s/w layer

  12. Extending SWAT Diagnosis to Multithreaded Apps • Assumptions: In-core faults, single core fault model • Naïve extension – N known good cores to replay the trace • Too expensive – area • Requires full-system deterministic replay • Simple optimization – One spare core • Not scalable, requires N full-system deterministic replays • High hardware overhead – requiresa spare core • Single point of failure – spare core C1 C2 C3 Spare Spare Spare Symptom Detected C1 C2 C3 Symptom Detected Faulty core is C2 No Symptom Detected C1 C2 C3

  13. mSWAT Diagnosis - Key Ideas Multithreaded applications Full-system deterministic replay No known good core Challenges Isolated deterministic replay Emulated TMR Key Ideas A B C D A B C D TA TB TC TD TA TB TC TD TA TA TA

  14. mSWAT Diagnosis - Key Ideas Multithreaded applications Full-system deterministic replay No known good core Challenges Isolated deterministic replay Emulated TMR Key Ideas A B C D A B C D TA TB TC TD TA TB TC TD TA TA TB TA TB

  15. mSWAT Diagnosis - Key Ideas Multithreaded applications Full-system deterministic replay No known good core Challenges Isolated deterministic replay Emulated TMR Key Ideas A B C D A B C D TA TB TC TD TC TA TA TB TC TA TB TC TD TB TA

  16. mSWAT Diagnosis - Key Ideas Multithreaded applications Full-system deterministic replay No known good core Challenges Isolated deterministic replay Emulated TMR Key Ideas A B C D A B C D TA TB TC TD TD TA TC TA TB TD TA TC TB TC TD TB TA Maximum 3 replays

  17. Multicore Fault Diagnosis Algorithm Overview Diagnosis Replay & capture fault activating trace Deterministically replay captured trace Symptom detected A B C D Example TA TB TC TD

  18. Multicore Fault Diagnosis Algorithm Overview Diagnosis Replay & capture fault activating trace Deterministically replay captured trace Symptom detected A B C D A B C D TA TB TC TD Example

  19. Multicore Fault Diagnosis Algorithm Overview Diagnosis Replay & capture fault activating trace Deterministically replay captured trace Symptom detected Look for divergence A B C D A B C D Example TD TA TB TA TC TB TC TD

  20. Multicore Fault Diagnosis Algorithm Overview Diagnosis Replay & capture fault activating trace Deterministically replay captured trace Symptom detected Faulty core Look for divergence Fault-free Cores A B C D A B C D A B C D Example TA TD TA TB TC TB TD TC TA Divergence Divergence Faulty core is A

  21. Digging Deeper What info to capture to enable isolated deterministic replay? How to identify divergence? Replay & capture fault activating trace Isolated deterministic replay Faulty core Symptom detected Look for divergence Hardware costs?

  22. Enabling Isolated Deterministic Replay Thread Ld Ld Ld Input to thread Ld • Recording thread inputs sufficient – similar to BugNet • Record all retiring loads values

  23. Digging Deeper (Contd.) What info to capture to enable isolated deterministic replay? How to identify divergence? Replay & capture fault activating trace Isolated deterministic replay Faulty core Symptom detected Look for divergence Trace Buffer Hardware costs?

  24. Identifying Divergence • Comparing all instructions  Large buffer requirement • Faults corrupt software through memory and control instrns • Capture memory and control instructions Store Thread Load Branch Store

  25. Digging Deeper (Contd.) What info to capture to enable isolated deterministic replay? How to identify divergence? Replay & capture fault activating trace Isolated deterministic replay Faulty core Symptom detected Look for divergence Trace Buffer Hardware costs?

  26. Hardware Costs Minor support for firmware reliability Memory Backed Log Small hardware support Native Execution Firmware Emulation Replay & capture fault activating trace Isolated deterministic replay Faulty core Symptom detected Look for divergence Trace Buffer How Big? What if the faulty core subverts the process? Key Idea: On a divergence two cores take over

  27. Trace Buffer Size • Long detection latency  large trace buffers (8MB/core) • Need to reduce the size requirement  Iterative Diagnosis Algorithm Repeatedly execute on short traces e.g. 100,000 instrns Trace Buffer Replay & capture fault activating trace Isolated deterministic replay Faulty core Symptom detected Look for divergence

  28. Experimental Methodology • Microarchitecture-level fault injection • GEMS timing models + Simics full-system simulation • Six multithreaded applications on OpenSolaris • 4 Multimedia apps and 1 each from SPLASH and PARSEC • 4 core system running 4-threades apps • Faults in latches of 7 arch units • Permanent (stuck-at) and transients faults

  29. Fault 10M instr If no symptom in 10M instr, run to completion Timing simulation Functional simulation Masked or Silent Data Corruption (SDC) Experimental Methodology • Detection: • Metrics: SDC Rate, detection latency • Diagnosis: • Iterative algorithm with 100,000 instrnsin each iteration • Until divergence or 20M instrns • Deterministic replay is native execution • Not firmware emulated • Metrics: Diagnosability, overheads

  30. Results: mSWAT Detection Summary • SDC Rate: Only 0.2% for permanents & 0.55% for transients • Detection Latency: Over 99% detected within 10M instrns Detected

  31. Results: mSWAT Detection Summary 4.5% detected in a good core • SDC Rate: Only 0.2% for permanents & 0.55% for transients • Detection Latency: Over 99% detected within 10M instrns

  32. Results: mSWATDiagnosability 99 99 99 86 100 80 99 95.9 • Over 95% of detected faults are successfully diagnosed • All faults detected in fault-free core are diagnosed • Undiagnosed faults: 88% did not activate faults

  33. Results: mSWAT Diagnosis Overheads • Diagnosis Latency • 98% diagnosed <10 million cycles (10ms in 1GHz system) • 93% were diagnosed in 1 iteration • Iterative approach is effective • Trace Buffer size • 96% require <400KB/core • Trace buffer can easily fit in L2 or L3 cache

  34. mSWAT Summary • Detection: Low SDC rate, detection latency • Diagnosis – identifying the faulty core • Challenges: no known good core, deterministic replay • High diagnosability with low diagnosis latency • Low Hardware overhead - Firmware based implementation • Scalable – maximum 3 replays for any system • Future Work: • Reducing SDCs, detection latency, recovery overheads • Extending to server apps; off-core faults • Validation on FPGAs (w/ Michigan)

More Related