270 likes | 402 Views
Software in Practice a series of four lectures on why software projects fail, and what you can do about it. Martyn Thomas Founder: Praxis High Integrity Systems Ltd Visiting Professor of Software Engineering, Oxford University Computing Laboratory. Lecture 2: Software Failures.
E N D
Software in Practicea series of four lectures on why software projects fail, and what you can do about it Martyn Thomas Founder: Praxis High Integrity Systems Ltd Visiting Professor of Software Engineering, Oxford University Computing Laboratory
Lecture 2: Software Failures • Developing software is very difficult • it is easy to make mistakes … • …. and they are unlikely to be found by testing • Errors can be introduced in every phase of software development: • requirements capture, specification, design, programming, building, error correction, modification, re-use ...
Finding faults by testing? type Alert is (Warning, Caution, Advisory); function RingBell(Event : Alert) return Boolean -- return True for Event = Warning or Event = Caution, -- return False for Event = Advisory is Result : Boolean; begin if Event = Warning then Result := True; elsif Event = Advisory then Result := False; end if; return Result; end RingBell; -- C130J code: Caution returns uninitialised (usually TRUE, as required).
Taurus • Taurus was a £50m system to provide electronic share trading for the London Stock Exchange in 1991, removing paper share certificates. (This would revolutionise the job of share registrars). • It overran: a recovery strategy was put in place, • It reached 85% complete and a date for cut-over was announced later the same year. A few weeks later, the project was cancelled. • City firms had wasted £350m on new systems to interface to Taurus.
Taurus: a requirements problem • The system was over-complicated and had failed to reconcile conflicting requirements, especially those from the share registrars.
This lesson has not been learnt ... • No public-sector civil project has ever been put out to tender with a formal specification. • For example, eFDP took two years to agree a set of requirements. The remaining difficulties were put in the requirements as six-month “design studies”. Four weeks after the RfP, the project was abandoned.
Nancy Leveson’s Torpedo:gaps in the specification • How to stop a torpedo blowing up the launch ship? • If it malfunctions or starts to come back: • sink it • blow it up • On live test, a torpedo failed whilst still in the torpedo tube… …
LAS: The Manual System • LAS covers 600 Sq Miles, carries >5000 patients each day; handles 2000-2500 calls daily including 1300-1600 emergency calls. 750 ambulances. • Emergency call written on a form. Location looked up on a map. Form and map co-ordinates placed on a conveyor belt to central dispatch, who remove duplicates and route to a zone to contact an ambulance • This took ~3 minutes and 200 staff. • Decision to implement Computer-Aided Dispatch.
LAS: Computer Aided Dispatch (CAD) version 1 • 1980s. £7.5 million spent. System built but failed its load test and was abandoned. LAS sued the Supplier, who had not understood the requirement properly. • 1990: Requirements started for Version 2. • New CAD to be “fully automated”. Automatic lookup of location; automatic selection of the best ambulance. • No similar system in existence
LAS: CAD Version 2 • New System much more complex than Version 1: CAD+Map Display+Automatic Vehicle Location Service (AVLS) • Andersen Consulting had estimated that a package solution without AVLS, if one existed, would cost £1.5m and take 19 months to implement. • This seems to have become the project budget for a custom system.
LAS: Version 2 bids • 35 companies looked, 19 bid, most said it needed more time and money than the budget • The only bidder who promised to meet all the requirements on time and within budget was a consortium of Apricot (hardware), Systems Options (SO - a small software house) and Datatrak (AVLS). • SO bid only £35K to develop the CAD software! Total bid £937,463 • The next lowest bid was £700K more!!
LAS: Version 2 development • Phase 1 system: no radio messaging • client and server lock-ups • Phase 2 system: with radio messaging • unstable, overloaded at shift change, radio blackspots, unable to cope with staff taking the “wrong” vehicle. • Managers decided to go live on 26 October 2002, ignoring independent review
LAS: Result • 26 October, control room reconfigured to use CAD. No manual backup system. • System progressively lost ambulances • screens filled with exception messages, that scrolled off and were lost • system delayed incidents, waiting for ambulances, so public called again, increasing the workload. • Several or zero ambulances sent to each incident. • Staff stress caused operator errors • Network congestion, slowdown, system collapse. • Oct 27th, semi-manual operation but system crashed through memory leak. System abandoned.
Therac 25 • (not the system on the previous slide) • A system for treatment of tumours • Mode 1: low energy electron beam treatment • Mode 2: very high energy beam (25MeV) with a thick metal plate in front, for X-rays. • Therac-20 had a mechanical switch to change beam, and an interlock to stop change to high energy without the plate. • Therac 25 interlock was in software.
Therac-25 User Interface • Set up treatment time • Electron beam, type e • X-ray beam, type x. • System puts the plate in place before switching beam to X-rays. • System: “Beam Ready”, Operator types b to start treatment. • Operator station in a different room from the patient, to protect staff from radiation
Therac: Accident • Ray Cox, oil worker, on the table for his regular e-beam treatment for a tumour on his shoulder. • Operator goes to the other room • types x, realises mistake, types “edit”, e, “enter” - all within 8 seconds. System says “Malfunction” • cleared the error, got “beam ready” and hit b • same error message, so tried again. Twice. • Ray felt a painful jolt - not like previous treatments. Shouted in pain but no-one heard. Third time he got off the table and went to find the nurse.
Therac 25: outcome • Ray Cox died of radiation overdose 4 months later. • Meanwhile another patient experienced the same accident, but this time a technician realised there was a problem and reported it. • The same problem had occurred in Georgia, Canada and Washington.
Therac: what went wrong? • The operator’s actions exposed a race-condition in the (multi-tasking) code. • The result was a full-power beam without the plate in place. 125-fold overdose! • The particular sequence of actions had never occurred in testing. • Made worse because audio intercom and video link both out of service. System error messages not informative (and usually meant treatment had not occurred).
Therac: Failings • Safety Case claimed 10-11 probability for “computer selects wrong energy”. No evidence for the claim. • No low-complexity protection system (fuse and/or interlock). • Poor software engineering. • Poor investigation of reported accidents. Manufacturer did not consider possible software fault until several accidents
Ariane V: Explosion • Initial launch exploded • Failure traced to the inertial navigation system (INS). • Overflow on conversion from 64-bit floating to 16-bit integer; exception not trapped • primary and back-up INS both failed for the same reason, and stopped • loss of INS led to auto-destruction.
Ariane V: cause of failure • INS software re-used from Ariane IV • Ariane IV flight profile guaranteed this parameter could not overflow • Ariane V specification was different, in a way that affected the requirements for the INS. • Formal specification would catch this fault.
Conclusions (1) • Software development is hard - all sorts of things go wrong. • It is an engineering task. You dare not do without discipline and rigour. • Even the best people make mistakes. That’s why we use reviews, checklists, type-checkers and other static analysis tools, testing, and proof.
Conclusions (2) A safety-critical software team must have: • Good domain knowledge • Excellent systems engineering / software engineering knowledge, skills, processes • Good knowledge of safety assessment principles, standards, practice and law, • … and finally ...
…a strong safety culture Developing safety-critical software is the subject of my next lecture.