670 likes | 703 Views
CS251 – Software Engineering Lecture 3: Software Crisis. Mohammad El-Ramly, PhD Some slides are taken from others Ariane 5: Ian Sommerville http://www.acadox.com/join/75UDWT. Lecture 3 Outline. Software Chronic Crisis Cost of Software Errors Ariane 5 Therac 25
E N D
CS251 – Software EngineeringLecture 3: Software Crisis Mohammad El-Ramly, PhD Some slides are taken from others Ariane 5: Ian Sommerville http://www.acadox.com/join/75UDWT
Lecture 3 Outline • Software Chronic Crisis • Cost of Software Errors • Ariane 5 • Therac 25 • Characteristics of Software Systems
I. Software’s Chronic Crisis • Chronic • of long duration, continuing • marked by frequent re-occurance • Crisis • a crucial state of affairs in which a decisive change is impending; especially one with the possibility of an undesirable outcome
Software’s Chronic Crisis • Many software system failto serve their purpose or even may cause harm. • Cost overruns • Schedule delays • Reduced functional deliverables • Many projects never deliverand are cancelled.
Software’s Chronic Crisis • Software product size is increasing exponentially • faster, smaller, cheaper hardware • Software is everywhere: from TV sets to cell-phones • Software is in safety-critical systems • cars, airplanes, nuclear-power plants
Software Is Everywhere1 • System software– collection of programs written to service other programs • Heavy interaction with computer hardware, multiple users, complex systems. • Examples: operating systems, drivers, telecommunication systems, etc. • Application software–standalone programs that solve specific business or technical need • Examples: office applications
Software Is Everywhere2 • Business software– business information processing • Management information system (MIS) that accesses one or more databases containing business information (e.g., payroll, inventory) • Engineering and scientific software(e.g., numerical estimations, simulation, etc.) • Web-based software • e-commerce, social networks, etc.
Software Is Everywhere3 • Real-time software– monitors, analyzes, and controls real-world events as they occur in real-time • Response time typically ranges from 1 ms to 1 sec • Automotive software, autopilot, etc. • Embedded software – control products and systems for consumer and industrial markets, e.g., digital TV. • Artificial intelligence software – uses non-numerical algorithms to solve complex problems • Robotics, games, pattern recognition, etc. • Mobile Applications
Software’s Chronic Crisis • We are the only industry that states something like this on their product licenses: • إخلاء المسؤولية عن الضمان. يتم ترخيص البرنامج "بالحالة التي عليها" و"على علاته" و"بالحالة التي يتم توفيره عليها".وبالتالي فإنك تتحمل مسؤولية استخدامه. لا تقدمMICROSOFT، والموزعون التابعون لها وأيٍ من الشركات التابعة لنا المعنية والموردون (المشار إليهم فيما بعد باسم "الموزعون")، أية ضمانات أو تعهدات أو شروط صريحة بموجب هذا البرنامج أو فيما يتعلق به. قد تكون لك بعض حقوق المستهلك الإضافية بموجب القوانين المحلية الخاصة بك، والتي لا يمكن لهذه الاتفاقية أن تغيرها. وإلى الحد الذي تسمح به القوانين المحلية الخاصة بك، ينفي الموزعون عن أنفسهم أي ضمانات أو شروط ضمنية، بما في ذلك الضمانات أو الشروط الخاصة بالقابلية للتسويق والملاءمة لغرض معين وعدم الانتهاك. • تحديد الأضرار واستثناؤها. يمكنك الحصول على تعويض من Microsoft ومورديها مقابل الأضرار المباشرة فقط بحيث لا يتجاوز ذلك المبلغ الذي دفعته مقابل البرنامج. لا يمكنك الحصول على أي تعويض يتعلق بأية أضرار أخرى بما في ذلك الأضرار اللاحقة أو خسارة الأرباح أو الأضرار الخاصة أو غير المباشرة أو العارضة.
Software’s Chronic Crisis • Failure of software causes the loss of: • Time • Money • User satisfaction, and • LIVES
Software Crisis Example • Therac-25: 1985 - 1987 • Computerized radiation therapy machines made by Atomic Energy Canada Limited (AECL) • Massive radiation overdoses by the Therac-25 between 6/85 and 1/87 • 3 deaths and 3 serious injuries • Error was due to a race condition and UI problem never caught in testing.
Software Crisis Examples … • Patriot MIM-104 (Feb 25, 1991) • Failure to intercept Scud caused by software error related to clock skew; • 28 US soldiers killed • Ariane 5 (June 4, 1995) • Software errors in inertial reference system • Mars Orbiter (Sept 1999) • Crashed because of metric / English unit confusion
Northeast Blackout of 2003 508 generating units and 256 power plants shut down Affected 10 million people in Ontario, Canada Affected 40 million people in 8 US states Financial losses of $6 Billion USD The alarm system in the energy management system faileddueto asoftwareerror and operators were not informed of the power overload in the system
Software Crisis Examples … • The day the phones stopped ringing • AT&T long distance network crash (15/1/90) • Missing break in C switch stat. in 106s LOC • In late 1989, AT&T engineers upgraded the software of their 114 US switching centers. • These computers make the connections so your phone links to the one you are calling. • 15/1/1990, they stopped working. Duplicate computers had the same software. • 70 million calls failed. AT&T lost $1 billion as customers fled to their competitors.
Software Crisis Examples … • Bank of America – MasterNet • Spent $23M on an initial 5 year accounting & reporting system • Spent $600M trying to make it work • Project cancelled • Lost customer accounts - $Billons
Software Crisis Examples … • Allstate Insurance – In 1982 • $8M computer system to automate business • EDS providing software • Initial 5 year project continued for 10 years, until 1993 • Cost approached $100M
Software Crisis Examples … • Blue Cross and Blue Shield of Wisconsin - 1983 • EDS hired to build $200M computer system • Delivered on time in 18 months • System didn’t work – issued $60M in overpayments and duplicate checks • BC lost 35,000 policy holders by 1987
Software Crisis Examples … • Vodafone Flex Plan (my mother) • Small SMS with your money consumption summary for the day • Comes 70 times a day x 20 pt • Numerous call to customer service with no help • Enter code • We will cancel it for you • We will write a report • It is a general problem, we’re working on it
Software Crisis Stats • Standish Group ‘94 CHAOS Report: • The US spends $250B on IT projects • 31.3% of projects will be cancelled before being completed • 52.7% will cost 189% of original est. • 16% = 100% - 31.3% - 52.7% is success rate. • 78.4% of software projects deployed with at least 74.2% of features • $140B in project waste
The Software Crisis - Improving • March, 2004 CHAOS Chronicles Report shows big improvements: • Project success rate increased to 34% (1994: 16% success rate) • Project failures rate declined to 15% (1994: 31% failure rate) • Challenged projects account for remaining 51%
The Software Crisis - Improving • 51% of challenged projects have a lower overrun ratio than in 2000 • 43% average cost overrun (1994: 180% cost overrun) • $55B spent on project waste (1994: $140B project waste) • $17B cost overruns (1994: $59B) • $38B lost projects (1994: $81B)
The Software Crisis - Improving • Not all good news: • Time overruns increased to 82% (2000: 63% time overruns) • 52% of required features/functions make it in released product (2000: 67% features in final product) (1994: 74% features in final product)
Lecture 3 Outline • Software Chronic Crisis • Cost of Software Errors • Ariane 5 • Therac 25 London Ambulance System • Characteristics of Software Systems
The Ariane 5 Launcher Failure June 4th 1996 Total failure of the Ariane 5 launcher on its first flight
Ariane 5 • A European rocket designed to launch commercial payloads (communications satellites, etc.) into Earth orbit • Successor to the successful Ariane 4 launchers • Ariane 5 can carry a heavier payload than Ariane 4
Launcher Failure • Appoximately 37 seconds after a successful lift-off, the Ariane 5 launcher lost control • Incorrect control signals were sent to the engines and these swivelled so that unsustainable stresses were imposed on the rocket • It started to break up and self-destructed • The system failure was a direct result of a software failure.
The problem • The altitude and trajectory of the rocket are measured by a computer-based system. There was a number conversion error that transmitted an extreme position to the engine. • The software failed and the system shut down. • A backup system took over but ran the same software and also made the same mistake and shut down.
Software failure • Failure occurred when converting a 64-bit floating point number to a signed 16-bit integer. This caused an overflow. • There was no exception handler associated with the conversion so the system exception management facilities were invoked. These shut down the software. • The backup software was a copy and behaved in exactly the same way.
Avoidable failure? • The software that failed was reused from the Ariane 4 launch vehicle. The computation that resulted in overflow was not used by Ariane 5. • Decisions were made • Not to remove the facility as this could introduce new faults • Not to test for overflow exceptions because the processor was heavily loaded. It was desirable to have some spare processor capacity
Why not Ariane 4? • The physical characteristics of Ariane 4 (A smaller vehicle) are such that it has a lower initial acceleration and build up of horizontal velocity than Ariane 5 • The value of the variable on Ariane 4 could never reach a level that caused overflow during the launch period.
Validation Failure • As the facility that failed was not required for Ariane 5, there was no requirement associated with it. • As there was no requirements, no tests were done for this part. Hence no chance of discovering the problem. • During system testing, simulators of the inertial reference system computers were used. These did not generate the error as there was no requirement!
Review failure • The design and code of all software should be reviewed for problems during the development process • Either • The inertial reference system software was not reviewed because it had been used in a previous version • The review failed to expose the problem.
Lessons Learned • Don’t run software in critical systems unless it is actually needed • As well as testing for what the system should do, you may also have to test for what the system should not do • Do not have a default exception handling response which is system shut-down in systems that have no fail-safe state
Lessons learned • In critical computations, always return best effort values even if the absolutely correct values cannot be computed • Wherever possible, use real equipment and not simulations • Improve the review process to include external participants and review all assumptions made in the code
Lecture 3 Outline • Software Chronic Crisis • Cost of Software Errors • Ariane 5 • Therac 25 London Ambulance System • Characteristics of Software Systems
Therac 25 • 1985-1987 • Therac-25 medical accelerator. A radiation therapy device malfunctions and delivers lethal radiation doses at several medical facilities. • Therac-25 was an "improved" therapy system that could deliver two different kinds of radiation: either a low-power electron beam (beta particles) or X-rays.
Therac 25 • The Therac-25's X-rays were generated by smashing high-power electrons into a metal target positioned between the electron gun and the patient. • Because of a subtle bug called a "race condition," a quick-fingered typist could accidentally configure the Therac-25 so the electron beam would fire in high-power mode but with the metal X-ray target out of position.
Therac 25 • It took two years to find this bug. • First, the company denied that it could be a fault in the device or software. • 6 patients died or were seriously injured because of this. • The company left medical equipment business.
Lessons Learned • AECL had ignored safety aspects of software • Confused reliability with safety • Lack of defensive design • Inadequate reporting and follow up • No explanation / follow up of Ontario accident • Inadequate software engineering practices: • Specs are afterthought, complex architecture, dangerous coding, little testing, careless HCI design…
III. Characteristics of Software Systems • How are software systems different than other engineering products? • How is building a software different than building a building? ?
Werewolves Out of all the scary monsters of the past, werewolves were the scariest because the change shape without notice.
Software Werewolf • Software is like a werewolf—it looks normal until the moon comes out and it turns into a monster • Missed deadlines • Blown budgets • Buggy software
No Silver Bullet • In 1987, in an article titled: “No Silver Bullet: Essence and Accidents of Software Engineering” • Frederick P. Brooks made the argument that there is no silver bullet that can kill the werewolf software projects
III. Characteristics of Software Systems • How are software systems different than other engineering products? • How is building a software different than building a building? ?