1 / 39

Shimin Chen LBA Reading Group Presentation

Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S. Adve, Y. Zhou (UIUC), ASPLOS’08. Shimin Chen LBA Reading Group Presentation. Introduction. Hardware reliability Aging/wear out

feo
Download Presentation

Shimin Chen LBA Reading Group Presentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Understanding the Propagation of Hard Errors to Software and Implications for Resilient System DesignM. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S. Adve, Y. Zhou (UIUC), ASPLOS’08 Shimin Chen LBA Reading Group Presentation

  2. Introduction • Hardware reliability • Aging/wear out • Infant mortality (insufficient burn-in) • Soft errors (radiation) • Design defects • Willing to pay 10% area overhead for reliability • Industry panel discussion in SELSE II • Conventional dual modular redundancy too costly • How?

  3. Two Observations • Only need to handle observable device faults • Faults that propagate through higher levels of the system and observable by software • Fault-free operation is the common case • Must be optimized • Willing to have increased overhead after a fault is detected

  4. Proposals: Cooperative HW-SW • Detect high-level anomalous SW behavior (symptoms of faults) • Checkpoint/replay + diagnosis components • (For mission-critical system, may incorporate previous backup detection techniques)

  5. Potential Advantages • Generality: oblivious to numerous failure mechanisms and microarchitectures • Ignoring masked faults • Optimizing for the common case • Customizability: which action to take upon fault? • Amortizing overhead across other system functions • Reuse online SW bug detection support

  6. Investigation in This Paper Question to answer: • Coverage: What HW faults produce detectable anomalous SW behavior w/ high probability? • Latency: What is the fault detection latency? • Impact on OS: • How frequently is OS state corrupted by HW faults? • Detection coverage and latency for such faults? Focus on permanent faults (increasingly important) Methodology: Fault-injection study using simulations

  7. Major Results • Detection coverage: most permanent faults that propagate to SW are easily detectable • Detection latency: <= 100K instructions for 86% cases • Impact on OS: often corrupt OS state

  8. Outline • SWAT System Assumptions • Methodology • Results • Implications for Resilient System Design

  9. SWAT (SoftWare Anomaly Treatment) The investigation assumes the following context: • Always-on SW symptom-based detection • A multicore system, at least one fault-free core • Checkpoint/replay mechanism • Replay when fault is detected • If anomalous behavior is deterministic, this is HW fault, recover using a fault-free core • Otherwise ignore (transient) • HW has the ability to repair or reconfigure around permanent faults • Firmware controlled diagnosis and recovery hide HW errors from becoming externally visible

  10. Outline • SWAT System Assumptions • Methodology • Results • Implications for Resilient System Design

  11. Simulation Environment • Virtutech Simics + Wisconsin GEMS micro-architectural and memory timing simulators • SPARC V9 ISA, 6 SpecInt2000, 4 SpecFP2000 • OS activity < 1% for fault-free runs

  12. Fault Injection • Timing-first approach in GEMS • Cycle-accurate GEMS timing simulator • Simics functional simulator • Compare and set GEMS state based on simics state (so GEMS can skip the support for some rare instructions) • Fault injection • Inject fault into GEMS timing simulator • If the mismatched states are due to fault injection, corrupt simics states • Activated fault vs. architecturally masked fault: • If GEMS state mismatched simics state? • OS or user mode? • Check privilege mode

  13. Fault Model: permanent faults • Stuck-at-0: • A bit is always 0 • Stuck-at-1: • A bit is always 1 • Dominant-0 • Acts like a logical-AND between adjacent faulty bits • Dominant-1 • Acts like a logical-OR between adjacent faulty bits • Dominant-x a.k.a. bridging fault

  14. Number of Injected Faults • 10 benchmarks • 40 random points per benchmark after initialization • 4 fault models • 8 micro-architectural structures • Total = 10 x 40 x 4 x 8 = 12800

  15. Fault Detection • Run 10 million instructions with detailed simulation • If no SW symptom is detected, run fast simulation to finish the benchmark and check for corruption

  16. Fatal HW Trap Typically not thrown during a correct execution SPARC: • Data Access Exception • Division by zero • Illegal instruction • Memory misaligned • Recover Error and Debug (too many nested traps) • Watchdog reset (no instruction retires in the last 65536 cycles)

  17. Abnormal Application Exit • Application may have a seg fault or assertion failure • OS knows the exit status • In simulation, looks for OS idle loop as an indication of such an exit

  18. Hangs • During the 10 million instructions • Keep a counter per observed branch PC • Increment the counter for a branch • If any counter exceeds 100,000 (or 1% of the total instructions), then flag a hang • Profiling the fault-free executions and mask out a handful of branches that do not satisfy this

  19. High OS Activity • Amount of time the execution remains in OS • Typically control returns to user-mode for a few 10s of instructions except • A timer interrupt after a quantum expires (this < 10,000 instructions) • System calls (could be 100K to 1 million instructions) • Detection threshold: • Over 30,000 contiguous OS instructions • But not in a system call

  20. Metrics • Coverage: • Masked faults: architecture + application • Detection latency: • total number of instructions retired from the first architecture state corruption till the detection of the fault within 10M instructions

  21. Outline • SWAT System Assumptions • Methodology • Results • Implications for Resilient System Design

  22. How do faults manifest in SW? FPUExcluded

  23. What are masked faults? • Stuck-at faults: • Register file: unused physical register • RAT: unused logical register • FPU: integer benchmarks • Bridging faults: • Upper 32-bit in a 64-bit operations • Often sign extensions: all-1 or all-0 • In SW, small data size

  24. Large number of detections in OS • Although low OS activity, over 65% of detected faults are through symptoms from the OS • Why? • A fault in user-mode often results in a memory access to a cold address, invoking a TLB miss • SPARC TLB miss is software managed • In OS trap, the same faulty HW • OS is more control and memory intensive • Often result in corrupted OS state

  25. Fatal HW Traps • Illegal instruction traps: • opcode bit changes result in illegal opcode (decoder) • Watchdog timer reset: • over 65536 cycles no retired instructions • ROB or RAT errors: register src, dest dependences are messed up, resulting in some kind of indefinite wait • Misaligned accesses: • Memory addresses are wrong • Red state exception: • Over 4 nested traps

  26. High-OS • OS trap handling TLB miss • Permanent HW fault corrupts TLB handler, resulting in the code never returning to user-mode • Significant overlap with fatal traps and hangs • High-OS detects 30% of the faults • Remove it reduces coverage by 15% • Many cases eventually lead to fatal traps or hangs • But detecting High-OS reduces latency

  27. Others • Application aborts: 1% coverage • Hangs: 3% coverage • Mostly in application • Because OS-hangs are often detected first as High-OS • E.g. loop index variable is wrong, never terminate

  28. Undetected Faults • All but FPU, 0.8% of injected faults result in silent data corruption • FPU: 10% of faults result in silent data corruption • Why? FPU results hardly affect memory addresses or program control

  29. Which SW components are corrupted? • Need to checkpoint OS • None case: watchdog reset trap, the first instruction in ROB is blocked

  30. Detection latency • Application state corruption • OS state corruption

  31. Latency from App State Corruption • Some Combination of SW and HW checkpointing schemes are needed

  32. Latency from OS State Corruption • HW checkpointing schemes may be sufficient

  33. Transient Faults Have Different Characteristics • 94% are architecturally masked within the 10M instruction window • 3.4% are detected in the 10M window • 1.2% are masked by applications • 1.3% eventually results in detectable symptoms • Only 0.1% of the total injections result in silent data corruption

  34. Outline • SWAT System Assumptions • Methodology • Results • Implications for Resilient System Design

  35. Detection • A majority of permanent faults that propagate to SW are detectable through low-cost monitoring of simple symptoms • Preliminary experiments show that the use of value-based invariants can significantly improve latency and coverage • FPU: use more HW mechanisms

  36. Recovery • OS recovery is necessary • HW recovery mechanisms (e.g. ReVive, SafetyNet) may be sufficient • Application recovery requires SW checkpoints

More Related