1 / 8

Missing in Action: Timing Analysis and Soft Error Protection

Detailed analysis of timing considerations and soft error protection in modern systems, with examples from aircraft and space technologies. Discussion on timing analysis challenges and methods for soft error mitigation.

prejean
Download Presentation

Missing in Action: Timing Analysis and Soft Error Protection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Missing in Action: Timing Analysis and Soft Error Protection Frank Mueller Center for Embedded Systems Research (CESR) Department of Computer Science North Carolina State University

  2. Example: A380 Overheat Detection • w/ Hamilton Sundstrand/United Techn. • Overall system has 54 sensors • When too hot, isolate air channels • Close valves over AFDX network • Avoids overheating upon leakage • plane’s hull is hybrid carbon/metal can burn hole into it! • SW has to adhere to RTCA DO-178B standard • Level A: conditional decision, branch/decision/stmt coverage • Level B: branch/decision/statement coverage • Level C: statement coverage • SW is written as cyclic executives

  3. Requirements • SW standard requirements – some examples: • All switch statements must have a default case • Single entry and single exit functions only • Strict type checking required • SW certification requirements • Qualified tools to check for adherence to standard • Simulation environment for testing functionality • Explicit tests for every low level requirement • Programmer independence • New: Timing guarantees (required by Airbus!) worst-case execution time (WCET) analysis

  4. Missing in Action 1: Timing Analysis • WCET: Worst-case execution time • needed for schedulability analysis • WCET bounds: determined by timing analysis • should be safe and tight • derived by tools: only semi-automated, small programs • restrictions: loop bounds, no heap, no func pointers • predictable architecture • Problems: • WCET >> actual execution time  under-utilization • Complexity wall: • timing analysis tools lagging behind architectural innovation • not getting closer (maybe even loosing) • Tools and methods lag behind  What to do?

  5. Timing Analysis: Status Quo and Needs • Capabilities of static timing analysis • In-order scalar pipeline, static branch prediction, split I/D $ • Contemporary processors • Out-of-order, multiple issue, dynamic branch prediction, multi-level caches, deep speculation, etc. • Analyzability fundamental to design of safe systems • excludes contemporary microarchitectures • Long-term implications • Complexity wall  need new methods for timing analysis • Promote hybrid HW/SW solution • Timings on actual processor in special execution mode • Steer execution through SW  realistic! (ARM) • Rigorous methodology and tools needed!

  6. Another Failure: Single Event Upset • Radiation from space due to solar flare can cause bit flips • Heavy ion strikes flip-/flop, RAM, … • Issue in higher atmosphere  planes over flying over poles • Typically sufficient to consider single (bit) event upset (SEU) • Multiple bits statistically too rare to care for • Also caused by smaller fabs  smaller noise ratios  errors • Protect RAM w/ ECC • Caches/processors unprotected • Unless radiation hardened  expensive • Examples: solar flares • Many failed servers in 1999 • Nozomi Mars Probe rendered inoperable • IBM has built-in checks for 80% of server-chip circuits

  7. SEU on the Airbus 380 • Uses PowerPC 750CXe • Off-the-shelve • RAM has ECC • L2 has ECC but L1 does not • No protection against SEU in processor core • Options: • Do not use L1 and best effort to “code against” SEU • Use EDDI: error detection by duplicating instructions • But who wants to pay the overhead? • Selective use of fault (SEU) resilient development techniques • Pure software or hybrid (minimal HW support + SW) • Protection only where needed in code • Rigorous methodology and tools needed!

  8. Conclusion • Off-the-shelve processors everywhere • Airbus 380, Boeing 787 • Automotive industry (waking up!) • Lack of predictability and protection • New methods for timing analysis • Increasing complexity gap • Promote hybrid HW/SW solution • Timings on actual processor in special execution mode • Steer execution through SW  realistic! (ARM) • New methods for soft error protection • Either pure software or hybrid (min. HW + SW) • Fault (SEU) resilient software development, selective • Missing in action: methods and tools needed today / yesterday !!!

More Related