160 likes | 177 Views
Failure in the PATHFINDER Mission. Chandan Kumar EE 585: Fault Tolerant Computing. Outline. Background Simplified view of H/W architecture S/W architecture Failure Cause Correction. Background. Launched Dec 4 1996 Landed July 4 1997. Mission Objectives:
E N D
Failure in the PATHFINDER Mission Chandan Kumar EE 585: Fault Tolerant Computing
Outline • Background • Simplified view of H/W architecture • S/W architecture • Failure • Cause • Correction EE 585: Case Study
Background • Launched Dec 4 1996 • Landed July 4 1997. Mission Objectives: • To prove that the development of "faster, better and cheaper" spacecraft is possible (with three years for development and a cost under US$ 150 million). • To show that it is possible to send a load of scientific instruments to another planet with a simple system and at one fifth the cost of a Viking mission. EE 585: Case Study
Background Contd. • To demonstrate NASA's commitment to low-cost planetary exploration finishing the mission with a total expenditure of US$ 280 million, including the launch vehicle and mission operations. • Demonstrate the mobility and usefulness of a micro rover on the surface of Mars • It carried a number of scientific instruments like Mars Pathfinder Lander: • Imager for Mars Pathfinder (IMP),(includes magnetometer and anemometer) • Atmospheric and meteorological sensors (ASI/MET) EE 585: Case Study
Background Contd. Rover Sojourner: • Imaging system (three cameras: front B&W stereo, 1 rear color) • Laser striper hazard detection system • Alpha Proton X-raySpectrometer (APXS) • Wheel Abrasion Experiment • Material Adherence Experiment • Accelerometers • Potentiometers • Final transmission Sept 27 1997. • 16500 images sent from lander,550 from rover • 15 analysis of rocks. EE 585: Case Study
Simplified view of Hardware Architecture • Single CPU – Controls the Spacecraft. • Resides on VME bus. • Interface cards for Radio and Camera. • Interface to 1553 bus. • 1553 bus connects to ‘cruiser’ and ‘lander’ stages. • H/W on Cruiser – controls thrusters .etc • H/W on Lander – interface to instruments like accelerometer,radar altimeter and ASI/MET etc. EE 585: Case Study
The Software Architecture |< ------------------------ .125 seconds ---------------------------->| |<***************| |********| |**>| |<- bc_dist active ->| bc_sched active | < - bus active - >| |<->| ----|-------------------------|-------------------------|------------|-----|----|--- t1 t2 t3 t4 t5 t1 The *** are periods when tasks other than the ones listed are executing. There is some idle time. t1 - bus hardware starts via hardware control on the 8 Hz boundary. The transactions for the this cycle had been set up by the previous execution of the bc_sched task. t2 - 1553 traffic is complete and the bc_dist task is awakened.t3 - bc_dist task has completed all of the data distributiont4 - bc_sched task is awakened to setup transactions for the next cyclet5 - bc_sched activity is complete EE 585: Case Study
The Failure: • The spacecraft began experiencing total system resets. • This reset reinitializes all of the hardware and software. It also terminates the execution of the current ground commanded activities. • The remainder of the activities for that day were not accomplished until the next day EE 585: Case Study
The Cause • The Failure - a case of Priority Inversion • In scheduling, priority inversion is the scenario where a low priority task holds a shared resource that is required by a high priority task. • This causes the execution of the high priority task to be blocked until the low priority task has released the resource, effectively "inverting" the relative priorities of the two tasks. • If some other medium priority task attempts to run in the interim, it will take precedence over both the low priority task and the high priority task. EE 585: Case Study
The Cause Contd. • The failure was identified by the spacecraft as a failure of the bc_dist task to complete its execution before the bc_sched task started • The ASI/MET task is delivered its information via an interprocess communication mechanism (IPC). • IPC mechanism based on using Pipes. • The higher priority bc_dist task was blocked by the much lower priority ASI/MET task that was holding a shared resource. EE 585: Case Study
The Cause contd.. • The resource that caused this problem was a mutual exclusion semaphore used within the select() mechanism. • The ASI/MET task had acquired this resource and then been preempted by several of the medium priority tasks. • The bc_dist task attempted to send the newest ASI/MET data via the IPC mechanism which called a Pipe. This pipe blocked taking the semaphore. EE 585: Case Study
The Cause contd.. • The medium priority tasks ran, still not allowing the ASI/MET task to run, until the bc_sched task was awakened. • At that point, the bc_sched task determined that the bc_dist task had not completed its cycle (a hard deadline in the system) and declared the error that initiated the reset. EE 585: Case Study
Correction • Changing the creation flags for the semaphore so as to enable the priority inheritance • Modify the semaphore associated with the pipe used for bc_dist task to ASI/MET task communications corrected the problem. EE 585: Case Study
S/W modification on the spacecraft • Patching is a specialised process. • Send the difference b/w what you have onboard and what you want on the spacecraft. • S/W on the spacecraft modifies the onboard copy. EE 585: Case Study
Questions?? EE 585: Case Study
References • http://mars.jpl.nasa.gov/missions/past/pathfinder.html • http://research.microsoft.com/%7embj/Mars_Pathfinder/Authoritative_Account.html • http://en.wikipedia.org/wiki/Mars_Pathfinder EE 585: Case Study