80 likes | 196 Views
Missing in Action: Timing Analysis and Soft Error Protection. Frank Mueller. Center for Embedded Systems Research (CESR) Department of Computer Science North Carolina State University. Example: A380 Overheat Detection. w/ Hamilton Sundstrand/United Techn. Overall system has 54 sensors
E N D
Missing in Action: Timing Analysis and Soft Error Protection Frank Mueller Center for Embedded Systems Research (CESR) Department of Computer Science North Carolina State University
Example: A380 Overheat Detection • w/ Hamilton Sundstrand/United Techn. • Overall system has 54 sensors • When too hot, isolate air channels • Close valves over AFDX network • Avoids overheating upon leakage • plane’s hull is hybrid carbon/metal can burn hole into it! • SW has to adhere to RTCA DO-178B standard • Level A: conditional decision, branch/decision/stmt coverage • Level B: branch/decision/statement coverage • Level C: statement coverage • SW is written as cyclic executives
Requirements • SW standard requirements – some examples: • All switch statements must have a default case • Single entry and single exit functions only • Strict type checking required • SW certification requirements • Qualified tools to check for adherence to standard • Simulation environment for testing functionality • Explicit tests for every low level requirement • Programmer independence • New: Timing guarantees (required by Airbus!) worst-case execution time (WCET) analysis
Missing in Action 1: Timing Analysis • WCET: Worst-case execution time • needed for schedulability analysis • WCET bounds: determined by timing analysis • should be safe and tight • derived by tools: only semi-automated, small programs • restrictions: loop bounds, no heap, no func pointers • predictable architecture • Problems: • WCET >> actual execution time under-utilization • Complexity wall: • timing analysis tools lagging behind architectural innovation • not getting closer (maybe even loosing) • Tools and methods lag behind What to do?
Timing Analysis: Status Quo and Needs • Capabilities of static timing analysis • In-order scalar pipeline, static branch prediction, split I/D $ • Contemporary processors • Out-of-order, multiple issue, dynamic branch prediction, multi-level caches, deep speculation, etc. • Analyzability fundamental to design of safe systems • excludes contemporary microarchitectures • Long-term implications • Complexity wall need new methods for timing analysis • Promote hybrid HW/SW solution • Timings on actual processor in special execution mode • Steer execution through SW realistic! (ARM) • Rigorous methodology and tools needed!
Another Failure: Single Event Upset • Radiation from space due to solar flare can cause bit flips • Heavy ion strikes flip-/flop, RAM, … • Issue in higher atmosphere planes over flying over poles • Typically sufficient to consider single (bit) event upset (SEU) • Multiple bits statistically too rare to care for • Also caused by smaller fabs smaller noise ratios errors • Protect RAM w/ ECC • Caches/processors unprotected • Unless radiation hardened expensive • Examples: solar flares • Many failed servers in 1999 • Nozomi Mars Probe rendered inoperable • IBM has built-in checks for 80% of server-chip circuits
SEU on the Airbus 380 • Uses PowerPC 750CXe • Off-the-shelve • RAM has ECC • L2 has ECC but L1 does not • No protection against SEU in processor core • Options: • Do not use L1 and best effort to “code against” SEU • Use EDDI: error detection by duplicating instructions • But who wants to pay the overhead? • Selective use of fault (SEU) resilient development techniques • Pure software or hybrid (minimal HW support + SW) • Protection only where needed in code • Rigorous methodology and tools needed!
Conclusion • Off-the-shelve processors everywhere • Airbus 380, Boeing 787 • Automotive industry (waking up!) • Lack of predictability and protection • New methods for timing analysis • Increasing complexity gap • Promote hybrid HW/SW solution • Timings on actual processor in special execution mode • Steer execution through SW realistic! (ARM) • New methods for soft error protection • Either pure software or hybrid (min. HW + SW) • Fault (SEU) resilient software development, selective • Missing in action: methods and tools needed today / yesterday !!!