E N D
Exercising Disaster Recovery The plan is no better than the exercise program. Miami University Information Technology Services has committed to exercising and testing its disaster-recovery plan at least twice a year. Techniques for developing and evaluating tabletop and drill exercises will be presented. Ohio Higher Education Computing Council 2006
Who is JdK? I am an IT Guy! • Decades of experience Systems Programming. • Information Technology Infrastructure Library (ITIL) Certificate of Competency • Certified Business Continuity Professional – Disaster Recovery Institute International. Systems Integration • Now pursuing PMP… With good relations with the business side of the university. KinneJd@MUOhio.Edu OHECC 4/20/2006
Overview What we want from Exercising What we did What we got OHECC 4/20/2006
What does Miami wants from a Disaster Recovery Exercise? Preparation Training Relationship building Publicity Evaluation Improvement (maybe have a little bit of fun) OHECC 4/20/2006
What does Miami wants from a Disaster Recovery Exercise? For a Financial Services exercise: • Pay Payroll • Pay Vendors • Manage Cash • Maintain information OHECC 4/20/2006
What We Did Types of Exercises Project methodology An Example OHECC 4/20/2006
Types of Disaster Recovery Exercises Walk Through Tabletop Drill (Operational) OHECC 4/20/2006
Walk Through Format: 1 hour meeting with a few staff; walk through a specific DR Procedure Participants: facilitator and trainees Purpose: train staff to use DR Procedures evaluate procedures Preparation: • Distribute procedure before meeting • Facilitator should have & understand DR procedure OHECC 4/20/2006
Tabletop Format: 4 hour meeting: exercise & debriefing Talk through a specific disaster scenario Participants: Players, Evaluators, Observers, Controllers From multiple departments Purpose: Preparation, Training, … Evaluation, Improvement Preparation: • Objectives • Scenario • Evaluation Criteria • People OHECC 4/20/2006
Drill Format: All day: exercise & debriefing Work through a specific disaster scenario Participants: Players, Evaluators, Observers, Controllers Purpose: Preparation, Training, … Evaluation, Improvement Preparation: • Objectives • Scenario • Evaluation Criteria • People OHECC 4/20/2006
Project methodology Seven Step Process • Concept • Initiation • Requirements • Development • Validation • Deployment • Close OHECC 4/20/2006
An Example Drill • MU Disaster Tolerance Architecture • Exercise Philosophy • Scenario • Anticipated Schedule of Events • Exercise Documentation OHECC 4/20/2006
MU Disaster Tolerance Architecture OHECC 4/20/2006
Exercise Philosophy • No harm to production environments • Partnership between IT & Client • Client chaired Evaluation Team • 80 / 20 Rule • 80% of the results from 20% of everything that could be tested. • Start with 1 pound weights OHECC 4/20/2006
Sample Scenario Today is Friday, December 14th, 2005. The skies are overcast and it is snowing lightly. The current temperature is 12°F. At 8:30 a.m., the lights in Hoyt Hall flickered and went out. Within a few seconds, they came back on and went out again. All lights, workstations (with the exception of a few laptops), and other electrical devices are without power. The Machine Room’s emergency lighting and indicator lights are lit indicating servers are still powered up. The Physical Facilities Department Operations Center was notified by telephone of an apparent failure of the Hoyt emergency generator. Within minutes the fire alarm is activated. Bright strobe lights and the high-pitched shrill of the fire alarm filled the building. Occupants grabbed jackets, purses and laptops and began evacuating the building. Before leaving, someone called 911 to report the fire alarm. Police Dispatch received the call from Hoyt Hall at 8:37 a.m. By 8:42 a.m. occupants of Hoyt Hall have left the building. Police and PFD staffs arrive on the scene by 8:45. A metal rod is found sticking out of the generator at 8:55. Domestic terrorism is highly suspected and the Miami University Emergency Operations Center is activated. This is a critical day for Payroll Services. Student payroll is scheduled to be paid. In addition, Accounts Payable needs to process their regular check runs to pay vendors and refund students. Treasury Services needs to process the daily cash and investment transactions. PFD informs the Information Technology Services’ Computing and Network Operations Center (CNOC) staff that the Hoyt machine room UPS has approximately 90 minutes of capacity. After 90 minutes the machine room will be without electrical power. OHECC 4/20/2006
Anticipated Schedule of Events 9:30 Controllers review Player’s Handbook with Players & other participants. Assistant Drill Controller reads scenario to IT players Failover based services continue to be available: Approx 9:40 Deputy CIO Appoints a Disaster Recovery Coordinator (DRC) Thereafter: DRC pursues recovery of services on “failover” equipment. Approx 10:30 Primary site is completely powered down. No electricity & no one allowed in. Lead Drill Controller informs Finance players that they no longer have IT services. Approx 12:00 Recovered services made available to Finance for test transactions. Thereafter: Finance staff pursues sample transactions Accounts Payable Payroll Treasury Services 3:30 Debriefing OHECC 4/20/2006
Exercise Documentation Exercise Plan Evaluation Plan Participants’ Handbook Memo to Participants OHECC 4/20/2006
Exercise Plan OHECC 4/20/2006
Simulated Drill Infrastructure OHECC 4/20/2006
Evaluation Plan • Exercise Objectives • Effective communication • Identify appropriate measures to restore financial services • Resolve technical engineering questions • Demonstrate the level of knowledge • Demonstrate the adequacy of current procedures, practices and knowledge OHECC 4/20/2006
Participants’ Handbook OHECC 4/20/2006
Assumptions • Finance & Business Services and Information Technology Services have established emergency plans and procedures. Those documents include mitigation, response and recovery elements. They may be brought to and used at the exercise. • Players will respond in accordance with the existing plans, procedures and policies. In the absence of applicable plans, procedures or policies, players will be expected to apply individual and/or team initiative to satisfy response requirements. • + others… OHECC 4/20/2006
Artificialities • The university’s banks are not participating in the exercise; procedures will prepare files for transmission but they will not be transferred. • The disaster recovery environment is a copy of the production environment, reflecting the state of the production environment approximately two days before the exercise. • Outage notices will not be emailed nor posted to web sites. Voice communications are tagged with “This is a disaster recovery exercise communication.” • The secondary computing center currently hosts neither the quick recovery database server nor the Citrix server. These machines will be moved to the secondary site when it can support them. • The Controller and Assistant Controller may add other artificialities during the drill; these should be documented for the After Action Report. • + others… OHECC 4/20/2006
Exercise Rules Players may talk to other players during the exercise. Players should work with other players to understand procedures and strategize solutions. In the event a player needs to talk with a non-player the player must first consult with the controller. The controller will log the request and will approve, disallow it, or provide the requested information. Evaluators, observers and other non-players should not offer advice or comments to the players, unless directed to do so by the controller who is responsible for logging the communication. Players should talk to the controller when they need to talk to someone for whom there is no player. For instance if bank personnel need to be called the player would talk to their controller since bank personnel are not participating in the exercise. Follow department/university procedures when they are available. One of the goals of the exercises is to evaluate existing procedures. Another is to determine if additional procedures are needed. Exercise voice communications and exercise emails sent out during the event should be prefixed with THIS IS A DISASTER RECOVERY EXERCISE COMMUNICATION Exercise voice communications and exercise emails sent out during the event should be suffixed with THIS WAS A DISASTER RECOVERY EXERCISE COMMUNICATION, THIS IS ONLY AN EXERCISE Production services should not be affected by exercise activities. Since banks are not participating care must be taken to make sure files are not transmitted to banks. Exercise may be rescheduled in the event of a critical incident which requires the attention of exercise participants. OHECC 4/20/2006
What we got – After Action Report OHECC 4/20/2006
What we got Preparation Training Relationship building Publicity Evaluation Improvement (maybe had a little bit of fun) OHECC 4/20/2006
What we got • Pay Payroll • Pay Vendors • Manage Cash • Maintain information OHECC 4/20/2006
What we got • Project to improve 2nd Site • Project to improve Remote Site • Improved procedures • Crisis Leadership Training • Positive Auditor Review OHECC 4/20/2006
Lessons Learned • Start with one pound weights • Expect creativity! • Expect surprises as well. • Project Manager wrote all documentation OHECC 4/20/2006
Comments / Questions OHECC 4/20/2006