COMS W3156: Software Engineering, Fall 2001

COMS W3156:Software Engineering, Fall 2001 Lecture #2: The Open Class Janak J Parekh janak@cs.columbia.edu

Important terminology (I) • NEW: Different colors from previous version. • ALL NEW: Software is not compatible with previous version. • UNMATCHED: Almost as good as the competition. • ADVANCED DESIGN: Upper management doesn't understand it. • NO MAINTENANCE: Impossible to fix.

Important terminology (II) • BREAKTHROUGH: It finally booted on the first try. • DESIGN SIMPLICITY: Developed on a shoestring budget. • UPGRADED: Did not work the first time. • UPGRADED AND IMPROVED: Did not work the second time.

Some leftover points from last class • Plagiarism: I was being cute last time – you will get into trouble if you are caught. • Books: They’re available from Papyrus, 114th and Broadway • Office hours: Sorry about this week… • Questionnaire: finally done, see http://softe.cs.columbia.edu • C/C++ students, talk to me

Next class – course “begins” • Read chapters 1 and 4 of Schach, if you have the book • The first one should be a breeze (introduction); the fourth isn’t that bad (teams) • We will also start discussing the project in detail in next class • Recitations will begin next week

Why Software Engineering? • We started discussing this last class • Mythical Man-Month: start reading it when you get a chance; we’ll go over it later • In the meantime, let’s discuss some case studies of how software engineering (or lack thereof) changed certain operations

Success/Failure: Mars Rover (I) • http://catless.ncl.ac.uk/Risks/19.49.html#subj1 • To the public, it was said in 1997 that “software glitches” and “too many things trying to be done at once” were the cause of the Pathfinder’s failures • In reality, “priority inversion” was at fault

Success/Failure: Mars Rover (II) • There were three main threads, scheduled preemptively • Information bus data-moving: high priority, frequent • Meterological data-gathering: low priority, occasional • Communications task: medium priority, occasional • Occasionally, the communications task would be scheduled during a blocked information bus operation, since the bus was waiting for the meteorological data to be gathered

Success/Failure: Mars Rover (III) • The communications task would prevent the meterological data work to be done, since it was higher priority • A watchdog would occur since the info bus was “dead”, resetting the entire system • The low-priority meterological task upended the system: “priority inversion”

Success/Failure: Mars Rover (IV) • Good news • They had left debugging mode on • The Rover was running VxWorks, a small runtime OS that has tracing capabilities • They managed to trace the source • Lastly, VxWorks has priority inheritance; this means a lower-priority process will inherit the priority of the blocked process if it’s higher. • They were able to upload a small change to solve the crash, as a consequence

Lessons: Mars Rover • Black box testing would have been impossible – had to see interrupts, etc. • Therefore, leaving debugging facilities on afterwards here was a big win • Designing for maintenance • Just because the data bus maintenance task ran frequently and is short means nothing

Failure: Therac-25 (I) • http://sunnyday.mit.edu/papers/therac.pdf - don’t read it if you are squeamish • Therac-25 was a linear accelerator released in 1982 for cancer treatment by releasing limited doses of radiation • This new model was software-controlled as opposed to hardware-controlled; previous units had software merely for convenience

Failure: Therac-25 (II) • Controlled by a PDP-11 computer; software controlled safety • In case of error, the software was designed to prevent harmful effects • However, in case of software error, cryptic codes were given back to the operator: “MALFUNCTION xx”, where 1 < xx < 64

Failure: Therac-25 (III) • Operators were rendered insensitive to the errors; they happened often, and they were told it was impossible to overdose a patient • However, from 1985-1987, six people received massive overdoses of radiation; several of them died

Failure: Therac-25 (IV) • Main cause: • Race condition often happened when operator entered data quickly, then hit the UP arrow key to correct, and values weren’t reset properly • AECL (the company) never noticed quick data-entry – their people didn’t do this on a daily basis • Apparently the problem existed in previous units, but they had a hardware interlock mechanism to prevent it; here, they trusted the software and took out the hardware interlock

Lessons from Therac-25 (I) • Overconfidence in software, especially for embedded systems • Reliability != safety • No defensive design, bizarre error messages • They just “bugfixed”, didn’t look for root causes • Complacency

Lessons from Therac-25 (II) • Improper software engineering practices • Most testing, in reality, was done in a simulated environment and a complete unit; little if any unit and software testing • They claimed 2700 hours of testing; it was really 2700 hours “of use” • Overly complex, poorly organized design • Blind software reuse

Is there a “successful” way? • Hard to say – software engineering is an imprecise field • There’s always “room to improve” • Nevertheless, there are many examples of million-dollar savings after initial investments that seemed large, but was quickly offset by the cost-savings • See the book

COMS W3156: Software Engineering, Fall 2001