360 likes | 682 Views
Software Safety: An Oxymoron?. March 29, 2007 Ken Wong, Ph.D., Senior Systems Analyst McKesson Medical Imaging Group. Points to Ponder*. A system can be correct and reliable and yet unsafe Software safety is not about bugs
E N D
Software Safety:An Oxymoron? March 29, 2007 Ken Wong, Ph.D., Senior Systems Analyst McKesson Medical Imaging Group
Points to Ponder* • A system can be correct and reliable and yet unsafe • Software safety is not about bugs • Program testing can be used to show the presence of bugs, but never to show their absence * We will return to these statements in the discussion
Outline • Introduction to Software Safety • Software: Meet System Safety • System Safety: Meet Software • Verifying Software Safety
Software In the Real World • Therac 25 accidents • Ariane 5 Flight 501 explosion • Titan 4 Centaur/Milstar failure • TCAS collision near Uberlingen, Germany
Ariane 501 Events • Destruction of Ariane 501 on 4 June 1996 (from final report): • nominal behaviour of the launcher up to H0 + 36 seconds; • failure of the back-up Inertial Reference System (SRI) followed immediately by failure of the active SRI;
Building Dependable Software … Quality Safety Correctness Reliability Security
Safety is a Distinct Property • Safety is a distinct part of the interlocking puzzle of how to build dependable software • A system can be “correct” and “reliable” and yet unsafe! • Improved software process alone does not mean a safer system • Note: These can be a contentious claims even among safety engineers.
Safety is … avoiding mishaps!
“Is it Safe”? Christian Szell: Is it safe? Babe: Yes, it's safe, it's very safe, it's so safe you wouldn't believe it. - Marathon Man 1976
System Safety • “System Safety” is a systematic approach to safety primarily developed in the US for the aerospace and defense industries • Spreading to other industries, e.g., health care • Focus on managing system hazards • E.g., FDA Quality System Regulation recommends “risk analysis” (A.K.A. hazard analysis)
Hazard ID Hazard Analysis Hazard Mitigation System Safety Risk Assessment Safety Verification
Hazard • A hazard is the system’s potential contribution to a mishap • E.g., brake failure, engine overheating • Key is understanding the system environment
Hazards and Mishaps hazard causes hazard mishap System Environment
Ariane 501: SRI Bug? • Uncaught exception from floating point conversion • From high value of BH (Horizontal Bias) • Programming 101! • Conversion check deliberately removed for performance reasons • SRI reused from Ariane 4 • Check not required for Ariane 4 trajectory
Safety is a System Property • SRI worked exactly as specified – for Ariane 4! • Ariane 5 trajectory different from Ariane 4 • SRI spec did NOT include Ariane 5 trajectory data • SRI NOT tested with Ariane 5 trajectory data • “Safety” cannot be understood without knowing the operational environment • FDA “use-related” vs “device failure” hazards • E.g., TCAS collision in Germany
When Software Met Safety • … there was a definite risk in assuming that critical equipment such as the SRI had been validated by qualification on its own, or by previous use on Ariane 4. • ARIANE 5 Flight 501 Failure Report
In the beginning (or Europe) …* • Mechanical systems with well understood designs • Hazards caused by component failure from random hardware faults • Mitigation through integrity and redundancy * Myth, but there is underlying truth in all good myths
Fault Tree Analysis Basic Event Steering Fails Intermediate Event OR Steering Assembly Fails Driver Error OR OR Steering Wheel Fails Drive Shaft Fails Steering Control Software Fails
Is Software Another Component? • What is the probability that the steering control software fails? • If software is just another component: • Software cannot wear out or breakdown like a mechanical component • Only “fault” is a programming bug • Assuming programmers do their job, failure rate should be zero* *Paraphrased from talk by a system safety engineer
Software Revealed Basic Event Steering Fails Intermediate Event OR Steering Assembly Fails Driver Error OR OR Steering Wheel Drive Shaft Fails Steering Control Software Fails
The Software Werewolf Of all the monsters that fill the nightmares of our folklore, none terrify more than werewolves, because they transform unexpectedly from the familiar into horrors … The familiar software project, at least as seen by the nontechnical manager, has something of this character … • Frederick P. Brooks, Jr. from No Silver Bullet : Essence and Accidents of Software Engineering
Ariane 501: Safety in Numbers? • In response to “fault”, the Primary SRI was deliberately shutdown • Attempt made to switch to backup SRI • Typical strategy in face of random failures • However, BOTH SRIs shutdown! • “Fault” due to same design in both SRIs • Exception in non-essential component
Safety is an Emergent Property • Software safety is not about “faults” • Many potential “faults” but not all created equal – most have no impact on safety • “Correct” behaviour can contribute to the hazard! • Hazards can emerge from complex interactions between “correct” components
When Safety Met Software • An underlying theme in the development of Ariane 5 is the bias towards the mitigation of random failure. • Board wishes to point out that software is an expression of a highly detailed design and does not fail in the same sense as a mechanical system. • ARIANE 5 Flight 501 Failure Report
Software and Safety Process Hazards Requirements Hazard ID, Analysis and Mitigation Design Safety Verification Verification Source Code
Limits of Testing Program testing can be used to show the presence of bugs, but never to show their absence • E. Dijkstra in Structured Programming
Hazard-Driven Testing • Focus on hazard – force it to occur • Consider: • Hazard risk (“risk-based testing”) • Mishap scenarios • Hazard causes identified during hazard analysis • Problem reports/issues with safety implications • See Jeffrey J. Joyce and Ken Wong, Hazard-driven Testing of Safety-Related Software
Summary and Conclusions • Safety is a distinct property • Safety is a system property • Operational and development environment factors • Safety is an emergent property • Hazards can emerge from complex interactions between “correct” components
References* • ARIANE 5 Flight 501 Failure Report by the Inquiry Board, Paris, July 1996 • Frederick P. Brooks, Jr., No Silver Bullet : Essence and Accidents of Software Engineering, Computer Magazine, April 1987 • Jeffrey J. Joyce and Ken Wong, Hazard-driven Testing of Safety-Related Software, 21st International System Safety Conference, Ottawa, Ontario, August 4-8, 2003 *All available on-line