1 / 16

Disaster Management at the Tier-1

Disaster Management at the Tier-1. Andrew Sansum 2 nd April 2009 RAL. Do You Recognise This. Burnt out UPS battery at ASGC Clearly a Disaster . Do You Recognise This?. Challenger Disaster . Cause of Challenger Disaster.

amma
Download Presentation

Disaster Management at the Tier-1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Disaster Management at the Tier-1 Andrew Sansum 2nd April 2009 RAL

  2. Do You Recognise This Burnt out UPS battery at ASGC Clearly a Disaster Tier-1 Status

  3. Do You Recognise This? Tier-1 Status

  4. Challenger Disaster Tier-1 Status

  5. Cause of Challenger Disaster • It was the “O” rings wasn’t it? “[The Rogers commission] found that the Challenger accident was caused by a failure in the O-rings … The failure of the O-rings was attributed to a design flaw, as their performance could be too easily compromised by factors including the low temperature on the day of launch” • Yes but there were underlying cause(s) • Communication Problems“..failures in communication... resulted in a decision to launch 51-L based on incomplete and sometimes misleading information, a conflict between engineering data and management judgments, and a NASA management structure that permitted internal flight safety problems to bypass key Shuttle managers.” • Management Errors:“The Commission found that as early as 1977, NASA managers had not only known about the flawed O-ring, but that it had the potential for catastrophe.” Tier-1 Status

  6. Why considered a disaster? • People died.“Challenger disintegrated about seventy-three seconds after launch, killing the seven astronauts aboard” • NASA’s reputation was badly damaged:“It also represented a serious blow to NASA's reputation, colouring the public perception of piloted spaceflight ..” • Financial losses and reduced funding opportunity“…and affecting the agency's ability to gain continued funding from Congress.” • Couldn’t meet operational commitments“Following the Challenger disaster, NASA grounded the remainder of the shuttle fleet while the risks were assessed more thoroughly, design flaws were identified, and modifications were developed and implemented.” Tier-1 Status

  7. Identify Potential Disasters • We do not (usually) mean the same thing when we say disaster as is meant by the “Challenger Disaster” • Nevertheless there are many outcomes we wish to avoid • Tier-1 Disaster Management plan seeks to identify circumstances that have a potential to significantly impact: • Safety • Services Commitments • Reputation • Financial Tier-1 Status

  8. Some Disasters • Can construct list of obvious disasters. Eg: • Fire/Flood etc • Loss of network • Security incident • We did this in the form of a risk analysis: DPv0.8.mht • Also have previous experience • CASTOR 2.1.7 upgrade • Disk firmware problems made it impossible to run delivered H/W • R89 delays (unable to manage deliveries) • Backplane burnout (not a disaster but very close) • Common themes: • The ones we generated tended to be operational and start suddenly • The ones we suffered were slow moving project management • Also need to be able to manage un-thought of disasters

  9. Evolution of a Disaster Sometimes fast Sometimes slow but similar result

  10. A Strategy • Create a Disaster Management System which handles all potential disasters in a similar way. • Identify common features and trigger levels to allow us to spot events before they blossom into disaster • Mess with existing processes as little as possible • Build specific contingency plans which add to the general response in specific circumstances. • Trigger early, trigger often, respond ahead of curve • Make use of the system routinely • Stops the system decaying • gives operational and project management benefits

  11. Don’t Confuse Disaster with Routine OPS Loss of power not a disaster ….. but …. Failure of routine restart may lead to disaster

  12. Routine Operations • We already have: • Production Team (Gareth, John Kelly and Tiju) • Admin on Duty (daytime) • on-call (nighttime) • Routine operations should be: • Looking for problems • Fixing things • calling experts • Notifying users • setting downtimes • assessing seriousness • reviewing events – improving future response • Not part of Disaster Management System • But prevents many things moving into the system

  13. Need Escalating Response • Start lightweight (Stage 1: Disaster Potential). • informally Assess/triage • Monitor/compare against standard contingencies • Set deadlines • watch for things leaving expected script but avoid interfering • Add some internal management (Disaster Possible) • Add internal (group) oversight • Formally assess • interfere more, divert resources • escalate response to imminent disaster (Disaster Likely) • Broaden oversight and expertise (include GRIDPP + department) • regular meetings with experiments • prepare contingencies • Manage actual disaster (stage 4: Disaster)

  14. At each stage • Formal list of pre-defined communications • Notify team of deadline to escalation • Notify PMB incident is moving onto disaster track • Notify esc senior staff • Advise Press & PR (as disaster approaches) • …. • Formal list of actions that should be carried out – eg: • Define Roles • Hold Incident Review Meeting • Start process to obtain financial approval • arrange exceptional experiment liaison meeting • review policy documents • …. • Formal list of criteria that get you to next stage

  15. Contingency Plans • Contingency plans supplement general disaster management system. • For each stage in the general system – supplement with: • Criteria to get (avoid) to this stage • Actions to take at stage • Communications make at stage • Example Contingency PlanContingency_Plan_Major_Security_Incident.mht Tier-1 Status

  16. Conclusions • Disaster Management System is working. Already managed: • Site DNS failure (reached Stage 1) • Power failure (reached stage 2) • Doesn’t replace our existing processes • But does make sure they are responding correctly • Expect it to manage equally well: • Operations failures (network down and out) • Project management failures (building delivered late) • Unexpected problems (eg man from mars at door) • Working well and giving immediate benefit • Doesn’t avoid planning for aftermath of building fire (but will help manage situation) • Still working on contingency planning and experiment requirements

More Related