210 likes | 400 Views
Death by Software. The Therac-25 Radio-Therapy Device Brian MacKay ESE6361 - Requirements Engineering – Fall 2013. The Atomic Age. World War II ushered in the atomic age The start of the nuclear arms race In many countries… The question was how to harness this power for peaceful purposes.
E N D
Death by Software The Therac-25 Radio-Therapy Device Brian MacKay ESE6361 - Requirements Engineering – Fall 2013
The Atomic Age • World War II ushered in the atomic age • The start of the nuclear arms race • In many countries… • The question was how to harness this power for peaceful purposes
In Canada: AECL • Atomic Energy of Canada Limited is a “Crown Corporation” • Designed and implemented a Heavy Water nuclear reactor • The CANDU system • It also included AECL-Medical • Harnessing the atom for medical reasons
AECL & CGR – Medical Accelerator Technology • AECL-Medical and the French company: la Compagnie Générale de Radiologie (CGR) • Worked together during the 1970s on using linear accelerators for radio-therapy • High energy, low dose, Electron beams, or • A stream of photons in the X-Ray spectrum • The two companies’ partnership produced • The 6 MeV, X-Ray only “Therac-6” • The dual mode, 20 MeV “Therac-20”
Therac-6 & Therac-20 • Stand-alone electro-mechanical units • Operator could • Set all settings manually • Position beam devices manually • Once everything was set, and system was “safe” – deliver the dose • The system had an optional computer that allowed a simpler UI • A Digital Equipment PDP-11 • 32 kilobytes of memory • All assembly code
True Innovation: the Therac-25 • AECL only – CGR partnership had dissolved • Used a Double-Pass accelerator • Halved the space that the Therac-6 & Therac-20 had occupied • Made the computer the primary controller • No stand-alone manual mode • Shipped in 1983 • Still used a DEC PDP-11
It was the best on the market… • Except… • It seriously injured 6 patients between 1985 and 1987 • Killing 3 of those patients • All because of software
Hubris • When an engineer graduates in Canada, he/she attendsThe Ritual Calling of an Engineer • And gets an Iron Ring • Rudyard Kipling wrote the ceremony • Instills a sense of professionalism • And humility
Supreme Faith in Software • It appears that this device had rigorous safety engineering on the hardware side • Complete hazard analysis – fault tree • On the software side, the likelihood of error was described in insanely low terms • Fault probabilities on the order of 10-9 and 10-11 • “Software does not degrade due to wear, fatigue or the reproduction process” • They had no expectation that a bug could cause a problem
Malfunction 54 • When there was a problem, the UI displayed the word “Malfunction” followed by a number 1-64 • There was NO documentation of what these codes were in the user manual • An internal AECL service manual described #54 as “dose input 2” and pointed out that this error code was only there for internal diagnostic reasons • Under normal conditions, an operator might see as many as 40 malfunction codes in a day • But Malfunction 54 was very rare • They were easily dismissed by pressing [P] (for “Proceed”)
Electron Mode vs. X-Ray Mode • In Electron Mode a low power beam is scanned across the patient • In X-Ray mode a high power beam is aimed at a target, producing X-Rays, which then irradiate the patient • The electron scanning mechanism and X-Ray target were mounted on a turntable • The position was controlled by the computer
Usability • User interface was a VT-100 Green Screen • Contained the Prescription • Entered by the operator • Originally – on error, prescription had to be re-entered • Usability studies changed this, near the end of the dev cycle • Introduced a major error PATIENT NAME : JOHN DOE TREATMENT MODE : FIX BEAM TYPE: X ENERGY (MeV): 25 ACTUAL PRESCRIBED UNIT RATE/MINUTE 0 200 MONITOR UNITS 50 50 200 TIME (MIN) 0.27 1.00 GANTRY ROTATION (DEG) 0.0 0 VERIFIED COLLIMATOR ROTATION (DEG) 359.2 359 VERIFIED COLLIMATOR X (CM) 14.2 14.3 VERIFIED COLLIMATOR Y (CM) 27.2 27.3 VERIFIED WEDGE NUMBER 1 1 VERIFIED ACCESSORY NUMBER 0 0 VERIFIED DATE : 84-OCT-26 SYSTEM : BEAM READY OP.MODE: TREAT AUTO TIME : 12:55. 8 TREAT : TREAT PAUSE X-RAY 173777 OPR ID : T25VO2-RO3 REASON : OPERATOR COMMAND:
A Race Condition – UI & Operations Threads • In the Therac-25, the prescription information was entered • The Electron/X-Ray mode • Then a command to execute • If the operator • Entered an X-Ray command in error • Re-edited the page and changed it to Electron • Then executed the dose, all within 8 seconds • Then the patient was given an X-Ray dose directly through the Electron turntable element PATIENT NAME : JOHN DOE TREATMENT MODE : FIX BEAM TYPE: X ENERGY (MeV): 25 ACTUAL PRESCRIBED UNIT RATE/MINUTE 0 200 MONITOR UNITS 50 50 200 TIME (MIN) 0.27 1.00 GANTRY ROTATION (DEG) 0.0 0 VERIFIED COLLIMATOR ROTATION (DEG) 359.2 359 VERIFIED COLLIMATOR X (CM) 14.2 14.3 VERIFIED COLLIMATOR Y (CM) 27.2 27.3 VERIFIED WEDGE NUMBER 1 1 VERIFIED ACCESSORY NUMBER 0 0 VERIFIED DATE : 84-OCT-26 SYSTEM : BEAM READY OP.MODE: TREAT AUTO TIME : 12:55. 8 TREAT : TREAT PAUSE X-RAY 173777 OPR ID : T25VO2-RO3 REASON : OPERATOR COMMAND: Malfunction 54
Why Have One Deadly Bug? • A second deadly bug was eventually found in the Therac-25 • The system periodically tested if everything is positioned properly, setting a variable with the result of the test • A zero indicated OK • Instead of simply setting the value to 1 or 0, the program incremented the value • And, the variable was a byte • The result was that every 256 tests of the positioning, the system would falsely indicate that everything was ready to proceed.
Noteworthy: The Users Found the Bugs • It’s worth noting that AECL’s reaction to the problems initially was denial • Eventually, the got to the stage where they did piecemeal fixes • Without the efforts of the staff at the East Texas Cancer Center in Tyler, AECL might never have acknowledged the first bug • After two accidents – with the same operator – they spent time trying to recreate the race condition • After the Therac-25, the FDA changed the way it evaluated software (and software engineering) in medical devices.
The Scorecard • One patient died of cancer, but would have died of radiation poisoning in a few weeks had the cancer not killed him
Not the Bugs – The Software Engineering • All software systems have bugs • Even Knuth hands out the occasional $2.56 check • AECL coalesced their entire operator interface, control system and safety system into one program • They apparently had very little in the way of formal requirements gathering, design or development standards • All of the software was developed by one programmer • Their reaction to the problems was to fix them one at a time
Software Reuse • The Therac-20 reused some of the software from the Therac-6 • The Therac-25 reused software from both of the previous models • But • The earlier models had hardware interlocks to prevent over-dosing • The desire to reuse previous software resulted in a • Home-made real-time operating system • On an expensive, 10 year old computer system • Running a program written entirely in assembly language • That relied on global variables for inter-task communication – without synchronization
No Requirement to Separate Layers • AECL architected the Therac-25’s software into a single point of failure • This was far from accepted practice in the early 1980s • Safety systems were migrating from hardware to software • But… they were usually separate, simpler systems – e.g. PLCs • By the early 80s, there were usually three distinct layers • Safety and integrity • Control and positioning • Operator interface and supervisory
Testability – Auditing • AECL’s task architecture and real time OS made adequate testing nearly impossible • Look at the deadly errors – neither is discoverable through testing • No auditing of operations, or failures was included in the system • After all the issues with the Therac-25, a check was done on the Therac-20 system and the same bugs were found • But, because that system had mechanical interlocks, no injuries resulted
References • “Medical Devices – The Therac-25”,Levenson, Nancy.http://sunnyday.mit.edu/papers/therac.pdf • “An Investigation of the Therac-25 Accidents”, Levenson, Nancy and Turner, Clark S., IEEE Computer, Vol. 26, No. 7, July 1993, pp. 18-41http://courses.cs.vt.edu/~cs3604/lib/Therac_25/Therac_1.html • “Fatal Dose - Radiation Deaths linked to AECL Computer Errors”,Rose, Barbara Wade, Saturday Night (magazine), June, 1994http://www.ccnr.org/fatal_dose.html • “Safety-Critical Computing: Hazards, Practices, Standards, and Regulation”, Jacky, Jonathan, http://staff.washington.edu/jon/pubs/safety-critical.html • “Therac-25”,Wikipediahttp://en.wikipedia.org/wiki/Therac-25 • “PDP-11”, Wikipediahttp://en.wikipedia.org/wiki/PDP-11 • “PDP-11 architecture”, Wikipediahttp://en.wikipedia.org/wiki/PDP-11_architecture