1 / 19

National Cancer Institute Center for Biomedical Informatics and Information Technology (CBIIT)

National Cancer Institute Center for Biomedical Informatics and Information Technology (CBIIT). 2009 Exercise for the NCI CBIIT LAN General Support System IT Contingency Plan Tabletop Exercise. March 31, 2009.

tegan
Download Presentation

National Cancer Institute Center for Biomedical Informatics and Information Technology (CBIIT)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. National Cancer InstituteCenter for Biomedical Informatics and Information Technology (CBIIT) 2009 Exercise for the NCI CBIIT LAN General Support System IT Contingency Plan Tabletop Exercise March 31, 2009 This document is confidential and is intended solely for the use and information of the client to whom it is addressed.

  2. Orientation, Training, and Exercise –Program Schedule • Training and Exercise Activities (50 min.) • Introduction • Scenario 1 Discussion • Scenario 2 Discussion • Summary/Q&A • Supporting Information and Wrap Up (5 min.) • Next steps/Comments and Corrections • Collect signed Key Personnel Acceptance forms

  3. Training and Exercise –Goals and Objectives • This training and exercise session seeks to accomplish the following: • Build on the orientation session • Familiarize you with the processes involved in implementing the plan • Raise your awareness of the contingency plan • Raise awareness of your role and responsibilities under the plan • Identify the three primary phases of the plan: • Identification/Notification • Recovery • Resumption/Restoration • Validate current contingency plan processes and procedures, and capture enhancements to the plan

  4. Process Flow - Putting It All Together

  5. Definitions:Minor vs. Major System Failure • Minor System Failure • A localized threat that causes the target system or environment (e.g., NCI LAN or major components therein) to be inoperable for less than the defined Maximum Allowable Outage (MAO) threshold (e.g., 2 hours for e-Dir, 1 day for I2E, etc.). • Parts of the system are still operational and the facility that houses the failed component or application has not been physically damaged to an extent that requires evacuation. • The CP will notbe activated for a minor application system failure. • Major System Failure • A localized threat that causes the target system or environment to be inoperable for more than the Maximum Allowable Outage (MAO), or one that affects the facility or hardware, and requires temporary restoration of services at an alternate location or on alternate resources. • Parts of the system may still be operational but the facility that houses the failed component(s) may have been physically damaged to an extent that requires personnel evacuation. • The CP will beactivated for a majorsystem failure.

  6. Tabletop Exercise – Minor Failure Scenario Background • Following a scheduled code release/upgrade on Wednesday at 8:00am, the database team discovers that an application’s data set has been erased. • The Database Team immediately notifies the Development Team Lead and the Product Line Manager of the issue. • During Damage Assessment, it is discovered that a script that was inadvertently left on the system is to blame. The Development Team Lead determines that the data will need to be restored from tape, and notifies the CP Coordinator (CPC) of the issue along with a an estimated recovery time of 4 hours. • The CPC quickly makes the determination that, in this case, a contingency WILL NOT be declared because the system would be fixed before exceeding the application’s MAO. • The Database Team requests that the last known good set of data be restored from backup media. • As the recovery effort begins, the Development Team Lead updates the Deployment Ticket in NCS3 with relevant information regarding the data issue and recovery plan. • The Storage Team restores the data, and notifies the Web and DB Teams when data is restored. • The Web Team conducts initial ‘smoke testing’ and notifies the DB Team that they can conduct necessary data configuration and final testing. • The DB Team reconfigures the data as needed, conducts final acceptance testing, and declares the system ready for use.

  7. Tabletop Exercise – Minor Failure Scenario (cont.) Group Discussion • What roles does the CPC have in this scenario? • The CPC may only be peripherally involved for common events like this one, with prescribed and tested remedial activities. • The CPC may direct that notice be sent to the system owner and/or all affected users, alerting them of the issue and expected recovery time. • The CPC should ensure that updates are being noted in the central ticketing system(s). • What CP roles do the Database and Web teams have in this scenario? • The Database and Web teams are the first responders who notice and report the issue • They act as the damage assessment team and provide recovery estimates. • They act as the recovery teams responsible for restoring the lost data and testing the system. • The teams should keep the their team leads and the CPC informed throughout recovery with periodic progress updates. • They should notify the CPC once the system has been restored and tested. • Based on the Damage Assessment that was carried out, what key parts of the application infrastructure were affected by this incident and required recovery operations? • Building Facilities? No • IT Infrastructure Failure (i.e., Server, Database)? Yes. • Application Failure? Yes (as a result of the temporary data loss)

  8. Tabletop Exercise – Minor Failure Scenario (cont.)–4 hours later Status Update • By noon on Wednesday the data recovery is complete, and the system has been successfully tested. • The Database and Web Team representatives working the issue notify the Development Team Lead that the application is running normally. • The Development Team Lead updates the trouble ticket in NCS3 and notifies the Product Line Manager that the recovery is complete, and the system is ‘ready to go live.’

  9. Tabletop Exercise – Minor Failure Scenario (cont.) What activities should take place next? • The Development Team Lead and Product Line Manager should monitor the system for an adequate period to ensure it is functioning normally. • The CPC should gather input from the Web and Database teams and document any lessons learned. • Although no formal After Action Report is required for a minor failure, a quick incident review (postmortem) may be instrumental in helping to identify and fix process flaws. • Implement and document any changes that result from the incident review.

  10. Process Flow - Putting It All Together (Minor Failure)

  11. Tabletop Exercise – Major Failure Background • At 2 pm on Thursday, a widespread power failure in the Rockville area occurred, knocking out primary power to NCI Executive Plaza as well as the surrounding metro area. • Because some systems were not automatically and promptly shutdown, and because the 6116 HVAC system is not on auxiliary building power, excessive heat build up in the 6116 Exec. Rm. 175 server room damaged a critical drive array that contained a large volume of original research data. The affected data had not been recently backed up. • During damage assessment it was discovered that the data on the failed array would require a lengthy and expensive data recovery by an outside vendor. The vendor estimated that this process would take approximately 1 week. • The CPC determined that a Contingency Event should be declared for the affected application (LPG) and that: • A) Backup data set should be restored to an alternate array; or • B) Alternate interim processing should be implemented. Because there was no backup copy of the affected data set, and the data would take 1 week to recover from the damaged array, NCI had no choice but to use Option B.

  12. Tabletop Exercise – Major Failure (cont.) Group Discussion • What roles does the CPC have in this scenario? • The CPC requests the damage assessment from the infrastructure teams (e.g., data, application, systems). • The CPC should direct that notice be sent to appropriate stakeholders (e.g., system owner and/or all affected users), alerting them of the issue and the expected recovery time if known. • The CPC should ensure that a service/trouble ticket is created in the central ticketing system (NCS3) and that updates are provided as warranted. • The CPC should monitor the data recovery vendor’s progress and escalate if needed to the NCI CIO (CIO acts as the CP Director for NCI). • The CPC should implement interim processing capabilities if possible to allow limited functionality of the affected system(s)/application(s).

  13. Tabletop Exercise – Major Failure (cont.) Group Discussion (cont.) • What role does the Infrastructure team have in this scenario? • The Storage and Security Team conducts the damage assessment and reports its findings, via their team leads, to the CPC. • Storage and Security Team packages and ships the failed array, using certified express courier, to the data recovery vendor. • Web team should work to temporarily restore partial functionality of the system if possible, for example, creating a temporary “scratch disk space” for the systems’ users to store new data. • The Storage and Security team should update the service ticket as warranted and keep the their team leads informed throughout recovery phase with periodic progress updates. • Storage and Security Team will notify the CPC when the data has been successfully returned and re-configured to work with the application (and temporary data are synchronized with the restored data). • Based on the Damage Assessment that was carried out, what key parts of the application infrastructure were affected by this incident and required recovery operations? • Building Facilities? No • IT Infrastructure Failure (i.e., Server, Database)? Yes. • Application Failure? Yes (as a result of the data loss)

  14. Tabletop Exercise – Major Failure (cont.) Status Update • One week later, on Thursday morning, the recovered data arrives back at NCI/CBIIT and is ready to be re-loaded onto a new disk array. • By Thursday afternoon, the data has all been restored to a local storage array. • The temporary data has also been synchronized with the recovered data. • The system has been “smoke tested” and declared ready for use. • The CPC is notified by the Recovery Team Lead that the system is again fully operational. • The CPC concurs and declares the systems ready for full use, • the CPC then notifies the system owner.

  15. Tabletop Exercise – Major Failure (cont.) General Discussion • Upon recovery of the failed infrastructure component(s), what resumptionactivities should take place? • While operating in a limited capacity or with alternate hardware, recovery personnel should prepare to return to primary environment once the data has been recovered and returned to NCI. • Ensure test plans are ready to test all data and application interfaces as needed before returning to normal operations. • Recovery Teams should return all materials, plans, and equipment back to storage or proper location. • All sensitive materials should be destroyed or returned to secure storage. • What should the CPC do during the Resumption Phase? • CPC notifies business owner to terminate any manual/alternate processing operations and processes if they were initiated. • CPC works with functional and recovery team leads to develop an After-Action Report and files it with NCI ISSO. • Conduct a postmortem review to address weaknesses and develop corrective action plan. • Implement corrective action plan to address weaknesses/issues identified.

  16. Process Flow – Putting It All Together (Major Failure)

  17. Summary/Q&A • In Review • Minor system failures do not require CP activation. • Major system failures may require CP activation, especially when effective downtime exceeds Maximum Allowable Outage (MAO), or when major components/infrastructure have been affected. • Everyone is responsible for reporting incidents. • Enforcement of CP roles and responsibilities ensures that disruptions are handled systematically. • Contingency Plan is a living document that must be periodically updated and modified to remain a viable tool. • Q&A • What are the primary ‘lessons learned’ from this exercise? • Are there any areas for improving the IT Contingency Plan? • How can you improve your contingency planning and readiness ?

  18. Wrap Up • Next Steps • Comments will be reviewed and incorporated into final Contingency Plan. • Review and verify the information in the CP (e.g., the call trees, R&R). • Continue to document significant disruptions and provide reports to the ISSO for record keeping purposes. • Continue developing supporting Disaster Recovery SOPs for each critical service and application environment supported by the LAN GSS (Gavin Brennan, Zoher Anis, and Kim Dierckson are spearheading this effort). Thank you!

  19. Quiz • Who is the NCI LAN Contingency Plan Coordinator (CPC)? • What does MAO stand for? • How many ‘phases’ make up the CP? • Can you name them? • Where are the NCI LAN Emergency Operations Centers Located?

More Related