80 likes | 90 Views
Detailed analysis of timing considerations and soft error protection in modern systems, with examples from aircraft and space technologies. Discussion on timing analysis challenges and methods for soft error mitigation.
E N D
Missing in Action: Timing Analysis and Soft Error Protection Frank Mueller Center for Embedded Systems Research (CESR) Department of Computer Science North Carolina State University
Example: A380 Overheat Detection • w/ Hamilton Sundstrand/United Techn. • Overall system has 54 sensors • When too hot, isolate air channels • Close valves over AFDX network • Avoids overheating upon leakage • plane’s hull is hybrid carbon/metal can burn hole into it! • SW has to adhere to RTCA DO-178B standard • Level A: conditional decision, branch/decision/stmt coverage • Level B: branch/decision/statement coverage • Level C: statement coverage • SW is written as cyclic executives
Requirements • SW standard requirements – some examples: • All switch statements must have a default case • Single entry and single exit functions only • Strict type checking required • SW certification requirements • Qualified tools to check for adherence to standard • Simulation environment for testing functionality • Explicit tests for every low level requirement • Programmer independence • New: Timing guarantees (required by Airbus!) worst-case execution time (WCET) analysis
Missing in Action 1: Timing Analysis • WCET: Worst-case execution time • needed for schedulability analysis • WCET bounds: determined by timing analysis • should be safe and tight • derived by tools: only semi-automated, small programs • restrictions: loop bounds, no heap, no func pointers • predictable architecture • Problems: • WCET >> actual execution time under-utilization • Complexity wall: • timing analysis tools lagging behind architectural innovation • not getting closer (maybe even loosing) • Tools and methods lag behind What to do?
Timing Analysis: Status Quo and Needs • Capabilities of static timing analysis • In-order scalar pipeline, static branch prediction, split I/D $ • Contemporary processors • Out-of-order, multiple issue, dynamic branch prediction, multi-level caches, deep speculation, etc. • Analyzability fundamental to design of safe systems • excludes contemporary microarchitectures • Long-term implications • Complexity wall need new methods for timing analysis • Promote hybrid HW/SW solution • Timings on actual processor in special execution mode • Steer execution through SW realistic! (ARM) • Rigorous methodology and tools needed!
Another Failure: Single Event Upset • Radiation from space due to solar flare can cause bit flips • Heavy ion strikes flip-/flop, RAM, … • Issue in higher atmosphere planes over flying over poles • Typically sufficient to consider single (bit) event upset (SEU) • Multiple bits statistically too rare to care for • Also caused by smaller fabs smaller noise ratios errors • Protect RAM w/ ECC • Caches/processors unprotected • Unless radiation hardened expensive • Examples: solar flares • Many failed servers in 1999 • Nozomi Mars Probe rendered inoperable • IBM has built-in checks for 80% of server-chip circuits
SEU on the Airbus 380 • Uses PowerPC 750CXe • Off-the-shelve • RAM has ECC • L2 has ECC but L1 does not • No protection against SEU in processor core • Options: • Do not use L1 and best effort to “code against” SEU • Use EDDI: error detection by duplicating instructions • But who wants to pay the overhead? • Selective use of fault (SEU) resilient development techniques • Pure software or hybrid (minimal HW support + SW) • Protection only where needed in code • Rigorous methodology and tools needed!
Conclusion • Off-the-shelve processors everywhere • Airbus 380, Boeing 787 • Automotive industry (waking up!) • Lack of predictability and protection • New methods for timing analysis • Increasing complexity gap • Promote hybrid HW/SW solution • Timings on actual processor in special execution mode • Steer execution through SW realistic! (ARM) • New methods for soft error protection • Either pure software or hybrid (min. HW + SW) • Fault (SEU) resilient software development, selective • Missing in action: methods and tools needed today / yesterday !!!