130 likes | 315 Views
Famous Software Failures. Why they failed and lessons to be drawn. Samuel Franklin G53QAT: Quality Assurance and Testing. Overview. Three Software Failures Patriot Missile Russian Satellite Missile Detection London Ambulance Service Summary of Findings Questions.
E N D
Famous Software Failures Why they failed and lessons to be drawn Samuel Franklin G53QAT: Quality Assurance and Testing
Overview • Three Software Failures • Patriot Missile • Russian Satellite Missile Detection • London Ambulance Service • Summary of Findings • Questions
The Patriot Missile Failure The Patriot Failing The Patriot in action • Feb 1991 – Gulf War • Failed to intercept Scud missile from Iraq • 28 dead • 100 injured • Error from storing value in fixed point register
Why it went wrong • The system had been running for 100 hours • The calculations were out by 0.34 seconds • Missed the Scud by over 600 meters • WOULD MISS AFTER 20 HOURS
What American learnt from this • USA knew of the fault from Israeli Military • American’s did not reboot regularly enough • Software update arrived day after the death of the soldiers
Russian Satellite Missile Detection System OKO • Put in place to detect threats from America during cold war • StanislavPetrov monitored system on 26th September 1983 • Oko alerted Petrov that 5 missiles were heading towards Russia. • Petrov had to choose: • Declare it a false alarm • Start a counterstrike and probably a Nuclear war
What Russia learnt from this • The Russians dissected the Oko System • Found the software full of bugs • Launched the SPRN-2 Prognoz to supplement the Oko system • Cost of this failure could have been: World War III
London Ambulance Fiasco • London Ambulance Service (LAS) introuduced a Computer Aided Dispatch System (CAD) on 26th October 1992 • LAS: • Carry over 5000 patients per day • Receive approx 2500 calls per day • 65% of calls are emergency • New system needed to have near 100% accuracy and full cooperation from all LAS to succeed
26th October 1992 LAS • The new CAD system could not handle the volume of call – regular use • Response time became several hours • Communications between ambulance and LAS lost • System had: • Poor interface between crews and the system • Number of technical problems: • Failed to identify duplicate calls • Did not prioritise exception messages
What London learnt from this • Do not use direct conversion • Implement in step-by-step fashion • Full consultation • Quality assurance and testing • User training
Conclusion • Testing is essential • All critical systems • Rush to get system in place is bad • Training • Value of humans in the process
Questions and Discussion Any questions?