140 likes | 407 Views
CSCI 5230: Project Management Software Reuse Disasters: Therac-25 and Ariane 5 Flight 501 David Sumpter 12/4/2001. Contents. Introduction 3 Therac-25 – Background 4 Therac-25 – Software Process 5 Therac-25 – Causes 6
E N D
CSCI 5230: Project ManagementSoftware Reuse Disasters: Therac-25 and Ariane 5 Flight 501David Sumpter12/4/2001
Contents • Introduction 3 • Therac-25 – Background 4 • Therac-25 – Software Process 5 • Therac-25 – Causes 6 • Flight 501 – Sequence of Events 7 • Ariane 5 Software Process 8 • Flight 501 – Cause of Failure 9 • Conclusion 10 • References 11
Introduction • Two famous software engineering disasters • Therac-25 • Medical accelerator to treat tumors • 6 known accidents resulting in death or serious injury • June 1985 – January 1987 • Software was adapted from earlier models • Ariane 5 Flight 501 • Maiden flight of Ariane 5 launch vehicle • Larger, more powerful successor to the Ariane 4 • Exploded approximately 40 seconds after launch • June 1996 • Loss traced to software carried over virtually unchanged from Ariane 4
Therac-25 – Background • 25 MeV medical accelerator • Designed to destroy tumors • Dual Mode • Electron beam or X-rays • Successor to Therac-6, Therac-20 • Therac-6, Therac-20 • Computer control added to earlier machines • Still capable of stand-alone (no computer) operation • All standard hardware safety mechanisms • Therac-25 more dependent on software • Lacked many hardware safety mechanisms of earlier accelerators • Therac-25 software “evolved from” Therac-6 code • PDP-11 assembly, no standard OS • Also contained Therac-20 code
Therac-25 – Software Process • Little, if any, process • Single programmer • Minimal unit and software testing • Emphasis on integrated system testing • 1983 safety analysis, in effect, assumed that software had no errors! • “Programming errors have been reduced by extensive testing ... Any residual software errors are not included in the analysis.” • “Computer execution errors are caused by faulty hardware components and by ‘soft’ (random) errors induced by alpha particles and electromagnetic noise.”
Therac-25 – Causes • For three of the six known incidents, cause is unknown • “there is no way to determine what particular design errors were related to the... accidents. Given the unsafe programming practices in the code, it is possible that unknown race conditions or errors could have been responsible” • For two fatal accidents in Tyler, Texas • Race condition led to inconsistent machine settings, leading to massive radiation overdoses • Same bug found in Therac-20 • Hardware safeguards prevented it from causing injuries, from even being discovered until after the Tyler accidents • Fatal accident at Yakima, Washington • Overflow of 1-byte variable which led, under rare conditions, to improper machine settings, leading to massive radiation overdose.
Flight 501 – Sequence of Events • Near-simultaneous failure of the primary and back-up Inertial Reference Systems (SRIs) • 36 seconds after main engine ignition • Nozzles of the two solid boosters and main engine swivel to extreme positions • Nozzles direct rocket thrust, steer launcher • Caused launcher to veer abruptly • Links between the solid boosters and the core stage rupture • triggered self-destruct
Ariane 5 Software Process • Stringent processes in place, but… • “the culture within the Ariane program…” only addressed “random hardware failures… which can quite rationally be handled by a backup system” • “the view had been taken that software should be considered correct until it is shown to be at fault”! (emphasis added) • SRIs were not included, but simulated by special software, in integrated tests • Technically difficult and expensive • SRIs considered “fully qualified at equipment level” • “The design of the Ariane 5 SRI is practically the same as that of an SRI which is presently used on Ariane 4, particularly as regards the software”
Flight 501 – Cause of Failure • Software exception “during… data conversion from 64-bit floating point to 16-bit signed integer” • Occurred in SRI software • Overflow caused by unexpectedly high value for Horizontal Bias (BH) variable • BH related to horizontal velocity • Not protected to save computer processing power • Analysis had determined that overflow could not occur • Reasoning not documented in code • Ariane 5 has higher horizontal velocity, early in trajectory, than Ariane 4! • That part of software where error occurred was not needed after launch • Requirement to continue operating after launch traces to earlier versions of Arian • Enabled prompt re-start of count-down in event of a hold • Did not apply to Ariane 5, but maintained for commonality
Conclusion • In both cases, software was carried over from earlier projects where it had seemingly worked well • Therac-25 • Software defects in earlier machines were hidden by hardware safeguards • No real software development process • Apparently no serious evaluation of risks involved in using software in lieu of hardware safeguards • Ariane 5 • Known “defect” was non-issue on Ariane 4 • Established software development process in place • Issues were considered, but key factor was missed
Conclusion, cont. • Misunderstanding of software? • Both were primarily hardware projects • Reuse of existing software in the development of new hardware • Not only underestimated complexity of software, but failed to recognize that it was even an issue • Both projects made the absolutely astounding assumption that the software didn’t have errors! • Assumed “black box” that could be swapped in and out of different applications • No evidence that reuse was considered in design of software
References • Inquiry Board. Ariane 5 Flight 501 Failure. Inquiry Board report (July 1996). • Available online at http://www.mssl.ucl.ac.uk/www_plasma/missions/cluster/ariane5rep.html • Leveson, N., Turner, C.S. An Investigation of the Therac-25 Accidents. IEEE Computer, vol. 26, no. 7 (July 1993), 18-41. • Available online at http://courses.cs.vt.edu/~cs3604/lib/Therac_25/Therac_1.html