450 likes | 464 Views
This paper explores the importance of software safety in embedded systems and provides insights into the methodologies and practices involved. It covers topics such as hazard analysis, software requirements, design and analysis, human-machine interaction, verification, and validation. The paper also discusses the challenges and complexities of ensuring safety in hardware and software components and the need for system-level methods and viewpoints.
E N D
Software Safety in Embedded Systems&Software Safety: Why, What, and How – Leveson UC San Diego CSE 294 Spring Quarter 2006 Barry Demchak
Previous Paper • System Safety in Computer-Controlled Automotive Systems – Leveson (2000) • Types of accidents • Safeware Methodology • Project Management • Software Hazard Analysis • Software Requirements Specification & Analysis • Software Design & Analysis • Design & Analysis of Human-Machine Interaction • Software Verification • Feedback from Operational Experience • Change Control and Analysis
Roadmap • Safety definitions • Industrial safety and risk • Systems Issues – hardware and software • Software Safety • Analysis and Modeling • Verification and Validation • System Safety Engineering
Safety Before Computers • NASA: 10-9 chance of failure over a 10 hour flight • British nuclear reactors: no single fault can cause a reactor to trip, and 10-7 chance over 5000 hours of failure to meet a demand to trip • FAA: 10-9 chance per flight hour (i.e., not within total life span of entire fleet)
Introduction of Computers • Nuclear Power Plants • Space Shuttle • Airbus Aircraft • Space Satellites • NORAD • Purpose: perform functions that are too dangerous, quick, or complex for humans
System Safety (def.) • Subdiscipline of systems engineering • Applies scientific, management, and engineering principals • Ensures adequate safety throughout the system life cycle • Constrained by operational effectiveness, time, and cost • MilSpec: “freedom from those conditions that can cause death, injury, occupational illness, or damage to or loss of equipment or property”
More Definitions • Accident • Unwanted and unexpected release of energy • Mishap (or failure) • Unplanned event or series of events • Death, injury, occupational illness, damage, or loss of equipment or property, or environmental harm • Hazard • A condition that can lead to a mishap
More Definitions (cont’d) • Risk • Probability of a hazardous state occurring • Probability of a hazardous state leading to a mishap • Perceived severity of the worst potential mishap that could result from a hazard • Hazard probability • Hazard criticality (severity)
Early Approach • Operational or Industrial Safety • Examining system during operating life • Correcting unacceptable hazards • Ignores crushing effect of single catastrophe • Assumptions • All faults caused by human errors could be avoided completely or located and removed prior to delivery and operation • Relatively low complexity of hardware
Ford Pinto (early 1970s) • Specifications: 2000 pounds, $2000 sale price • Use existing factory tooling • Safety issue with gas tank placement • Analysis • Deaths cost $200,000, burns cost $67,000 • Cost to make change $137M, benefit $49M • Ford engineer: “But you miss the point entirely. You see, safety isn't the issue, trunk space is. You have no idea how stiff the competition is over trunk space.” • Ford president: “Safety doesn’t sell” • Verdict: $100M
Anecdotes • Safety devices themselves have been responsible for losses or increasing chances of mishaps • Redundancy sometimes degrades safety • Unrelated (but related) systems cause errors
Later Approach • System Safety • Design acceptable safety level before actual production or operation • Optimize safety by applying scientific and engineering principals to identify and control hazards through analysis, design, and management procedures • Hazard analysis identifies and assesses • Criticality level of hazards • Risks involved in system design
Later approach (cont’d) • Assumptions • Complexity of software and hardware interaction causes non-linear increase in human-error-induced faults • Impossible to demonstrate safety ahead of usage • Complexity and coupling are covariant
Hardware vs Systems • Hardware • Widgets have long history of use and fault analysis … highly responsive to redundant techniques • Infinite number of stable states • Software • No history with software … reuse is rare • Large number of discrete states without repetitive structure • Difficult to test under realistic conditions
More Systems Issues • Difficult to specify completely – what it does, and what it does not do • Cannot identify misunderstandings about requirements • Engineers assume perfect execution environments, don’t consider transient faults • Lack of system-level methods and viewpoints
Even Bigger Systems Issues • Specification and implementation of components is not the same as between components • Between-component interactions grow exponentially and are often underrepresented in analyses • Components include • Software and components • Hardware • Human operators
Still Bigger Systems Issues • More Components • Development Methodologies • Source code maintenance • Verification/Validation Methodologies • Stakeholder Values • Management • Individual Programmers • Customer • Human Users • Suppliers
Definitions • Reliability • Probability that system will perform intended function • Safety • Probability that hazard will not lead to a mishap • Reliability = failure free • Safety = mishap free • Reliability and Safety often conflict
Safety • Studied separately from security, reliability, or availability • Separation of concerns • Safety requirements are identified and separated from operational requirements • Conflicts resolved in a well-reasoned manner
Definitions • System • Sum total of all component parts • Software is only a part, and its correctness exists only in relation to other system components
Software Safety • Ensures software will execute within a system context without resulting in unacceptable risk • Safety-critical software functions • Directly or indirectly allow a hazardous system state to exist • Safety-critical software • Contains safety-critical functions
System Characteristics • Inputs and outputs over time • Control subsystem • Description of function to be performed • Specification of operating constraints (quality, capacity, process, and safety) • Safety constraints are hazards rewritten as constraints • Safety constraints written, maintained, and audited separately
Analysis and Modeling • Preliminary Hazard Analysis (PHA) • Subsystem Hazard Analysis (SSHA) • System Hazard Analysis (SHA) • Operating and Support Hazard Analysis (OSHA) • Safeware – Leveson
Hazard Analysis • Start with list of identifiable hazards • Work backward to discover combination of faults that produce the hazard • Categorization • Frequent • Occasional • Reasonably remote • Remote • … physically impossible
Hazard Examples(Nuclear Weapons) • Inadvertent nuclear detonation • Inadvertent prearming, arming, launching, firing, or releasing • Deliberate prearming, arming, launching, firing, or releasing under inappropriate conditions
Software Requirement Analysis • Hard to do • Cubby-hole mentality • Rarely includes what the system should not do • Techniques • Fault Tree Analysis (FTA) • Real Time Logic (RTL) • Petri nets
Real Time Logic • Model the system in terms of events and actions (both data dependency and temporal ordering) • Generate predicates • Determine whether a safety assertion is a theorem derivable from the model • Inherently unsafe means that the assertion cannot be derived from the model
Time Petri Nets • Mathematical modeling of discrete event systems in terms of conditions and events and the relationship between them • Facilitates backward analysis • Points to failures and faults which are potentially most hazardous • Nontrivial to build and maintain
Research Question • What is the place of these analysis techniques in an agile development environment??
Safety Verification and Validation • Showing that a fault cannot occur • Showing that if a fault occurs, it is not dangerous • Only as good as the specifications • Specifications are usually incomplete, and hardware specifications are rare
Safety Verification and Validation • Methodologies • Proofs of adequacy • Software Fault Tree (proofs of fault tree analyses) • Determine safety requirements • Detect software logic errors • Identify multiple failure sequences involving different parts of the system • Inform critical runtime checks • Inform testing
Safety Verification and Validation • Methodologies • Nuclear Safety Cross Check Analysis (NSCCA) • Demonstrate that software will not contribute to a nuclear mishap • Multiple technical analyses demonstrate adherence to specifications • Demonstrate security and control measures • A lot of qualitative judgment regarding criticality • Software Common Mode Analysis • Sneak Software Analysis
Safety Analysis – Quantitative • Requires statistical histories which may not exist • Applies mostly to physical systems • Single-valued Best Estimate • Information sufficient for determinate models • Probabilistic • Science is understood, but limited parameters available • Bounding • Putting a ceiling on the answer
System Safety Engineering • Identify hazards • Assessing hazards (likelihood and criticality) • Design to eliminate or control hazards • Assess risks that cannot be eliminated or controlled
Failure Mode Definitions • Fail-safe • Default is safe mode, no attempt to execute operational mission • Fail-operational • Default is to correct fault and continue with operational mission • Fail-soft • Default is to continue with degraded operations
Designing for Safety • Not possible to ensure safety by analysis or verification alone • Analysis and verification may be cost-prohibitive • Different standard hierarchy • Intrinsically safe • Prevents or minimizes occurrence of hazards • Controls the hazard • Warns of presence of hazard
Safety Design Mechanisms • Lockout device • Prevents event from occurring when hazard is present • Lockin device • Maintains an event or condition • Interlock device • Assuring operation sequences in correct order
Safety Design Principals • Provide leverage for certification • Avoid complexity where possible • Reduce risk by reducing hazard likelihood, or severity, or both • Modularize to separate safety-critical functions from non-critical functions • Execute safety-critical functions under separate authority • Fail on a single-point failure
Safety Design Principals (cont’d) • Start out in safe state, and take affirmative actions to reach higher risk states • Check critical flags as close as possible to actions they protect • Avoid compliments: absence of “armed” is not “safe” • Use “true” values to indicate safety … “false” values can result from common hardware failures
Safety Design Principals (cont’d) • Detection of unsafe states • Watchdog timer • Independent monitors • Asserts and exception handlers • Use backward recovery (return system to safe state) instead of forward recovery (plow ahead)
Human Factors • Define partnership between human and computer • Avoid complacency • Avoid confusion • Avoid passive monitoring
Conclusion • Select suite of techniques and tools spanning entire software development process • Apply them consciensciously, consistently, and thoroughly • Consider implementation tradeoffs • Low catastrophe, high cost alternatives • Moderate catastrophe, moderate cost alternatives • High catastrophe, low cost alternatives
Take Home Messages • Safety is a system issue – in the large sense • Software engineering techniques can contribute to system safety – in both a narrow and broad context • Acceptable risk is king, and determining and executing it is hard