180 likes | 285 Views
FOR0383 Software Quality Assurance. Lecture 2 ESA Ariane 5 Rocket Flight 501. 4 June 1996. at ~40 seconds into launch at an altitude of ~3700m the launcher veered off path and began to break up the self-destruct system was triggered ~$500 million (uninsured, maiden flight)
E N D
FOR0383 Software Quality Assurance Lecture 2 ESA Ariane 5 Rocket Flight 501 Dr Andy Brooks
4 June 1996 • at ~40 seconds into launch • at an altitude of ~3700m • the launcher veered off path and began to break up • the self-destruct system was triggered • ~$500 million (uninsured, maiden flight) • the launcher was unmanned Dr Andy Brooks
Board of Inquiry • what was the cause of failure? • was appropriate testing undertaken? • what corrective actions should there be? • the report by the Board of Inquiry was completed in less than 6 weeks Dr Andy Brooks
Weather conditions • the weather was acceptable • there was no risk of lightning • but visibility had worsened for a time • the launch was delayed by about 1hr The Challenger Space Shuttle disaster was partly due to the weather. Overnight conditions at the launch pad had been extremely cold which meant the O-rings on the booster rockets were brittle and prone to fracture. Dr Andy Brooks
Briefly • nominal behaviour of the launcher until H0 + 36 seconds • the backup Inertial Reference System fails • the active Inertial Reference System fails • after the backup • all the rocket nozzles are swivelled into extreme positions • the launcher breaks up and the self-destruct system was triggered Dr Andy Brooks
Recovery of material • debris fell back to ground, scattered over a wide area (5 x 2,5km) • despite mangrove swamps, the two Inertial Reference Systems were recovered • telemetry data was received on the ground • trajectory data was received from radar stations • optical observations (camera and film) Dr Andy Brooks
Unrelated Anomaly • at H0 + 22 seconds • variations started in the hydraulic pressure of the actuators of the main engine nozzle with a frequency of 10Hz • “This phenomenon is significant and has not yet been fully explained, but after consideration it has not been found relevant to the failure.” Dr Andy Brooks
Inertial Reference System (SRI) • complex piece of equipment • measures attitude and movements in space • output transmitted to the On-Board Computer (OBC) executing the flight control program • to improve reliability, two SRIs operated in parallel with identical hardware and software First question to ask: how is the system backed up?... Dr Andy Brooks
Equipment Redundancy • there are two On-Board Computers • and a number of other units in the flight control system are also duplicated Dr Andy Brooks
So, what really happened? • the OBC received incorrect data • the SRI had declared a failure due to a software exception (Operand Error) • a data conversion from a 64-bit floating point was too large for the target 16-bit signed integer value • this particular data conversion was not protected Dr Andy Brooks
…Different Trajectory • the operand error occurred because Ariane 5 built up a horizontal velocity much more quickly than Ariane 4 • Ariane 5 built up horizontal velocity five times more quickly than Ariane 4 • the failure context was precisely determined from memory readouts from the recovered SRIs Dr Andy Brooks
Ariane family Dr Andy Brooks
…No useful purpose • the software module which generated the exception served no useful purpose after launch! • simply re-used from Ariane 4 “Effective reuse requires design by contract. Without a precise specification attached to each reusable component - precondition, postcondition, invariant - no one can trust a supposedly reusable component. Without a specification, it is probably safer to redo than to reuse.”Jean-Merc Jézéquel and Betrand Mayer, IEEE Computer, January 1997 p130 Dr Andy Brooks
Unprotected variables? • 3 variables were unprotected “because a maximum workload target of 80% had been set for the SRI computer” • remember, this is a real-time system • the justification was not given in source code • the reasoning was that variables were either physically limited or there was a large safety margin • this was true for Ariane 4 • the decision to protect some but not all of the variables was taken jointly by project partners Dr Andy Brooks
The specification of exception-handling contributed to the failure. • the failure should be indicated on the databus • the OBC interpreted the diagnostic data it was sent as valid data, causing the nozzle deflections • remember, the backup SRI failed first • the failure context should be stored in EEPROM memory • the SRI processor should be shut down • this approach addressed random hardware failures Dr Andy Brooks
Testing • no test was performed to verify that the SRI would behave correctly when subject to the count-down and trajectory of Ariane 5 • the SRI specification did not contain Ariane 5 trajectory data as a functional requirement “It would have been technically feasible to include almost the entire inertial reference system in the overall system simulations which were performed. For a number of reasons it was decided to use the simulated output of the inertial reference system, not the system itself or its detailed simulation. Had the system been included, the failure could have been detected.” Dr Andy Brooks
Recommendations R1 … no software function should run during flight unless it is needed R2 … test facility must include as much real equipment as possible… Complete simulations must take place... R3 … do not allow sensors to stop sending best effort data Dr Andy Brooks
… more Recommendations R5 review all flight software… identify all implicit assumptions R9 include external participants when reviewing specifications, code and justification documents (someone with a fresh mind can sometimes easily spot mistakes that the authors miss) R14 provide more transparent organisation of co-operation among partners Dr Andy Brooks