Software Safety in Embedded Systems: Why, What, and How?

Software Safety in Embedded Systems&Software Safety: Why, What, and How – Leveson UC San Diego CSE 294 Spring Quarter 2006 Barry Demchak

Previous Paper • System Safety in Computer-Controlled Automotive Systems – Leveson (2000) • Types of accidents • Safeware Methodology • Project Management • Software Hazard Analysis • Software Requirements Specification & Analysis • Software Design & Analysis • Design & Analysis of Human-Machine Interaction • Software Verification • Feedback from Operational Experience • Change Control and Analysis

Roadmap • Safety definitions • Industrial safety and risk • Systems Issues – hardware and software • Software Safety • Analysis and Modeling • Verification and Validation • System Safety Engineering

Safety Before Computers • NASA: 10-9 chance of failure over a 10 hour flight • British nuclear reactors: no single fault can cause a reactor to trip, and 10-7 chance over 5000 hours of failure to meet a demand to trip • FAA: 10-9 chance per flight hour (i.e., not within total life span of entire fleet)

Introduction of Computers • Nuclear Power Plants • Space Shuttle • Airbus Aircraft • Space Satellites • NORAD • Purpose: perform functions that are too dangerous, quick, or complex for humans

System Safety (def.) • Subdiscipline of systems engineering • Applies scientific, management, and engineering principals • Ensures adequate safety throughout the system life cycle • Constrained by operational effectiveness, time, and cost • MilSpec: “freedom from those conditions that can cause death, injury, occupational illness, or damage to or loss of equipment or property”

More Definitions • Accident • Unwanted and unexpected release of energy • Mishap (or failure) • Unplanned event or series of events • Death, injury, occupational illness, damage, or loss of equipment or property, or environmental harm • Hazard • A condition that can lead to a mishap

More Definitions (cont’d) • Risk • Probability of a hazardous state occurring • Probability of a hazardous state leading to a mishap • Perceived severity of the worst potential mishap that could result from a hazard • Hazard probability • Hazard criticality (severity)

Early Approach • Operational or Industrial Safety • Examining system during operating life • Correcting unacceptable hazards • Ignores crushing effect of single catastrophe • Assumptions • All faults caused by human errors could be avoided completely or located and removed prior to delivery and operation • Relatively low complexity of hardware

Ford Pinto (early 1970s) • Specifications: 2000 pounds, $2000 sale price • Use existing factory tooling • Safety issue with gas tank placement • Analysis • Deaths cost $200,000, burns cost $67,000 • Cost to make change $137M, benefit $49M • Ford engineer: “But you miss the point entirely. You see, safety isn't the issue, trunk space is. You have no idea how stiff the competition is over trunk space.” • Ford president: “Safety doesn’t sell” • Verdict: $100M

Anecdotes • Safety devices themselves have been responsible for losses or increasing chances of mishaps • Redundancy sometimes degrades safety • Unrelated (but related) systems cause errors

Later Approach • System Safety • Design acceptable safety level before actual production or operation • Optimize safety by applying scientific and engineering principals to identify and control hazards through analysis, design, and management procedures • Hazard analysis identifies and assesses • Criticality level of hazards • Risks involved in system design

Later approach (cont’d) • Assumptions • Complexity of software and hardware interaction causes non-linear increase in human-error-induced faults • Impossible to demonstrate safety ahead of usage • Complexity and coupling are covariant

Hardware vs Systems • Hardware • Widgets have long history of use and fault analysis … highly responsive to redundant techniques • Infinite number of stable states • Software • No history with software … reuse is rare • Large number of discrete states without repetitive structure • Difficult to test under realistic conditions

More Systems Issues • Difficult to specify completely – what it does, and what it does not do • Cannot identify misunderstandings about requirements • Engineers assume perfect execution environments, don’t consider transient faults • Lack of system-level methods and viewpoints

Even Bigger Systems Issues • Specification and implementation of components is not the same as between components • Between-component interactions grow exponentially and are often underrepresented in analyses • Components include • Software and components • Hardware • Human operators

Still Bigger Systems Issues • More Components • Development Methodologies • Source code maintenance • Verification/Validation Methodologies • Stakeholder Values • Management • Individual Programmers • Customer • Human Users • Suppliers

Definitions • Reliability • Probability that system will perform intended function • Safety • Probability that hazard will not lead to a mishap • Reliability = failure free • Safety = mishap free • Reliability and Safety often conflict

Safety • Studied separately from security, reliability, or availability • Separation of concerns • Safety requirements are identified and separated from operational requirements • Conflicts resolved in a well-reasoned manner

Definitions • System • Sum total of all component parts • Software is only a part, and its correctness exists only in relation to other system components

Software Safety • Ensures software will execute within a system context without resulting in unacceptable risk • Safety-critical software functions • Directly or indirectly allow a hazardous system state to exist • Safety-critical software • Contains safety-critical functions

System Characteristics • Inputs and outputs over time • Control subsystem • Description of function to be performed • Specification of operating constraints (quality, capacity, process, and safety) • Safety constraints are hazards rewritten as constraints • Safety constraints written, maintained, and audited separately

Constraints, Requirements, Design

Analysis and Modeling • Preliminary Hazard Analysis (PHA) • Subsystem Hazard Analysis (SSHA) • System Hazard Analysis (SHA) • Operating and Support Hazard Analysis (OSHA) • Safeware – Leveson

Hazard Analysis • Start with list of identifiable hazards • Work backward to discover combination of faults that produce the hazard • Categorization • Frequent • Occasional • Reasonably remote • Remote • … physically impossible

Hazard Examples(Nuclear Weapons) • Inadvertent nuclear detonation • Inadvertent prearming, arming, launching, firing, or releasing • Deliberate prearming, arming, launching, firing, or releasing under inappropriate conditions

Software Requirement Analysis • Hard to do • Cubby-hole mentality • Rarely includes what the system should not do • Techniques • Fault Tree Analysis (FTA) • Real Time Logic (RTL) • Petri nets

Fault Tree Example

Real Time Logic • Model the system in terms of events and actions (both data dependency and temporal ordering) • Generate predicates • Determine whether a safety assertion is a theorem derivable from the model • Inherently unsafe means that the assertion cannot be derived from the model

Time Petri Nets • Mathematical modeling of discrete event systems in terms of conditions and events and the relationship between them • Facilitates backward analysis • Points to failures and faults which are potentially most hazardous • Nontrivial to build and maintain

Research Question • What is the place of these analysis techniques in an agile development environment??

Safety Verification and Validation • Showing that a fault cannot occur • Showing that if a fault occurs, it is not dangerous • Only as good as the specifications • Specifications are usually incomplete, and hardware specifications are rare

Safety Verification and Validation • Methodologies • Proofs of adequacy • Software Fault Tree (proofs of fault tree analyses) • Determine safety requirements • Detect software logic errors • Identify multiple failure sequences involving different parts of the system • Inform critical runtime checks • Inform testing

Safety Verification and Validation • Methodologies • Nuclear Safety Cross Check Analysis (NSCCA) • Demonstrate that software will not contribute to a nuclear mishap • Multiple technical analyses demonstrate adherence to specifications • Demonstrate security and control measures • A lot of qualitative judgment regarding criticality • Software Common Mode Analysis • Sneak Software Analysis

Safety Analysis – Quantitative • Requires statistical histories which may not exist • Applies mostly to physical systems • Single-valued Best Estimate • Information sufficient for determinate models • Probabilistic • Science is understood, but limited parameters available • Bounding • Putting a ceiling on the answer

System Safety Engineering • Identify hazards • Assessing hazards (likelihood and criticality) • Design to eliminate or control hazards • Assess risks that cannot be eliminated or controlled

Failure Mode Definitions • Fail-safe • Default is safe mode, no attempt to execute operational mission • Fail-operational • Default is to correct fault and continue with operational mission • Fail-soft • Default is to continue with degraded operations

Designing for Safety • Not possible to ensure safety by analysis or verification alone • Analysis and verification may be cost-prohibitive • Different standard hierarchy • Intrinsically safe • Prevents or minimizes occurrence of hazards • Controls the hazard • Warns of presence of hazard

Safety Design Mechanisms • Lockout device • Prevents event from occurring when hazard is present • Lockin device • Maintains an event or condition • Interlock device • Assuring operation sequences in correct order

Safety Design Principals • Provide leverage for certification • Avoid complexity where possible • Reduce risk by reducing hazard likelihood, or severity, or both • Modularize to separate safety-critical functions from non-critical functions • Execute safety-critical functions under separate authority • Fail on a single-point failure

Safety Design Principals (cont’d) • Start out in safe state, and take affirmative actions to reach higher risk states • Check critical flags as close as possible to actions they protect • Avoid compliments: absence of “armed” is not “safe” • Use “true” values to indicate safety … “false” values can result from common hardware failures

Safety Design Principals (cont’d) • Detection of unsafe states • Watchdog timer • Independent monitors • Asserts and exception handlers • Use backward recovery (return system to safe state) instead of forward recovery (plow ahead)

Human Factors • Define partnership between human and computer • Avoid complacency • Avoid confusion • Avoid passive monitoring

Conclusion • Select suite of techniques and tools spanning entire software development process • Apply them consciensciously, consistently, and thoroughly • Consider implementation tradeoffs • Low catastrophe, high cost alternatives • Moderate catastrophe, moderate cost alternatives • High catastrophe, low cost alternatives

Take Home Messages • Safety is a system issue – in the large sense • Software engineering techniques can contribute to system safety – in both a narrow and broad context • Acceptable risk is king, and determining and executing it is hard

Software Safety in Embedded Systems: Why, What, and How?