80 likes | 252 Views
ACIS Ops Future Response to BEP Watchdog Reboots ACIS Ops Team. Future Responses to BEP watchdog reboots. ACIS Ops believes the BEST course of action in a BEP watchdog reboot is to warm boot the BEP with the standard operating procedure SOP_ACIS_WARMBOOT_DEAHOUSEKEEPING (est 30 min).
E N D
ACIS Ops Future Response to BEP Watchdog Reboots ACIS Ops Team
Future Responses to BEP watchdog reboots ACIS Ops believes the BEST course of action in a BEP watchdog reboot is to warm boot the BEP with the standard operating procedure SOP_ACIS_WARMBOOT_DEAHOUSEKEEPING (est 30 min). We would like Flight Director concurrence to allow us to take this action in the future with rapid FD approval during the initial anomaly telecon. • Data systems has never processed data from software version 11. The upgrade to version 26 happened on Jul 27, 1999. • A warm boot will help us determine immediately if we need further action at the time. If the warm boot is successful, the software in memory will be reloaded (currently version 31). This will allow the daily loads to continue and the ACIS personnel will be able to analyse the data while we continue the mission. If the warm boot is NOT successful, we will have a more urgent action to take. This will allow us to use our limited resources more efficiently • This should be our standard response to a watchdog reboot. To prevent long discussions and possible extension of comm to take action, we believe this should be the standard response to a watchdog reboot with rapid FD approval.
History of ACIS Flight SW Patches • Version 11: in PROM on board. ACIS reverts to this after a watchdog reboot • Version 26: Standard A (7/27/99):Standard A patches for biastiming (SPR 117), corruptblock (SPR 113), digestbiaserror (SPR 116), histogramvar (SPR 115), rquad (SPR 121), histogrammean (SPR 123), and zap1expo (SPR 122). • Version 27: Standard A Optional A (7/29/99): Added following optional patches event histogram, CC3x3 • Version 30: Standard B Optional B (1/8/00): Added the following standard B patches, condoclk (SPR 127), fepbiasparity2 (SPR 130), and cornermean (SPR 128). • Version 31: Standard B Optional C (6/3/04): Added the following Optional C patches, compressall (SPR 134) and smtimedlookup (N/A) • Version 44: Standard C Optional C (10/1/08-removed 10/6/08): Added the following standard C patches, tlmbusy (SPR 138) and buscrash (SPR 140). Items underlined are important to have correct science runs and/or for CXCDS to process data correctly.
Conditions needed to perform a Warmboot • Expect all hardware telemetry to be normal except for the Watchdog reboot flag • Expect to be seeing software housekeeping telemetry. • Expect the SW in memory to have loaded and booted properly at last upgrade and has been running for several days with multiple successful science observations. This condition is subject to the exact nature of the last patches and the science run at the time of the watchdog reboot. • If there is an indication of a problem in one of the above areas, an assessment of the situation is required before action is to be taken. The SOP_61010_DEA_HKP (est 10 min) should be run at this time to restart the DEA Housekeeping while maintaining SW version 11. • A checklist of expected states is to be verified before requesting a warmboot
The Anomaly: • During ObsId 9209 (day 279), a fast TOO, the BEP performed a watchdog reboot as the result of a BEP Hardware Exception. Operating Conditions at Time of Reboot • ACIS software version 44, loaded on day 275. • First CC mode since loading v44. We had completed 8 successful TE mode observation with v44 software at the time of reboot. • Reboot happened after 7122 seconds of data collection. Cause of Reboot • Believed to be either a hardware error (SEU) or a software error related to v44 patches. Please see Peter Ford’s memo: “Investigation of the OBSID 9209 Anomaly”
ACIS Ops Team actions • (10/4/08) 23:44 EDT ACIS Ops on-call personnel called OC/CC to report issue when alerted by software. • (10/5/08)00:02 EDT Telecon started, alert sent to sot_red_alert. • Discussed two possible actions: warm reboot or reload version 31 software. • Requested data from dump as soon as possible • (10/5/08) 00:49 EDT Planned on discussing results from data analysis and performing one of the above actions at the next pass. Set time for next telecon at 10/5/08 5:30. • (10/5/08) 2:17 EDT Dump data on colossus for ACIS Team • (10/5/08) 2:17-3:00 EDT Data analysis gave time of the BEP Hardware exception. Determined about 7700 seconds of data were taken. Based on these items, ACIS Ops recommended a warm reboot to return to version 44 software and to continue observations. • (10/5/08) 7:50 EDT Performed warm reboot and DEA housekeeping restart on day 279. • (10/5/08) Continued to analyse data. Peter Ford of ACIS MIT team supported a reload of version 31 software in case of a unknown bug in version 44. • (10/5/08) 22:20 EDT Decision was supported by the team and version 31 was uploaded on day 280 before the replan load was started.
Estimated timeline with fast warmboot approval. • (10/4/08) 23:44 EDT ACIS Ops on-call personnel called OC/CC to report issue when alerted by software. • (10/5/08)00:02 EDT Telecon started, alert sent to sot_red_alert. • Discussed two possible actions: warm reboot or reload version 31 software. • Requested data from dump as soon as possible • See if we could extend comm and execute SOP_ACIS_WARMBOOT_DEA- HOUSEKEEPING (estimate 10 minutes to execute) • (10/5/08) 00:49 EDT Planned on discussing results from data analysis and performing one of the above actions at the next pass. Set time for next telecon at 10/5/08 5:30. • (10/5/08) 2:17 EDT Dump data on colossus for ACIS Team • (10/5/08) 2:17-3:00 EDTData analysis- Could have made the suggestion to reload 31 at this point OR waited until morning to do data analysis. • (10/5/08) 7:50 EDTCould have reloaded version 31 at this point. • (10/5/08) Continued to analyse data. Peter Ford of ACIS MIT team supported a reload of version 31 software in case of a unknown bug in version 44. • (10/5/08) 22:20 EDTDecision was supported by the team and version 31 was uploaded on day 280 before the replan load was started. (could have been done in morning instead).