110 likes | 200 Views
Crisis Management and DR. Larry K. Peck Disaster Recovery Consultant Office of Information Resources State of Tennessee. Software System Failure Hardware System Failure Network or Telecommunications Carrier Failure Human Error Cause Uncertain or Unknown
E N D
Crisis Management and DR Larry K. Peck Disaster Recovery Consultant Office of Information Resources State of Tennessee
Software System Failure Hardware System Failure Network or Telecommunications Carrier Failure Human Error Cause Uncertain or Unknown Environmental Factors (Such As Power Outages) Security Breach or Virus 26% 21% 15% 13% 12% 10% 3% Downtime and Availability - Factors Contributing to Downtime • The following seven categories were identified as the factors contributing to downtime (Gartner-April 2004 Survey of 145 IT Organizations):
Emergency Management Team • EMT is First Response to Crisis Event • Identified 1st Responders from various functional and business units • Disaster Assessment Teams (DAT) – inspect equipment and facilities, report to EMT • Interfaces together • Executive Management • Financial Management • Technical Management • Functional Response Teams • Press Relations Team • Conduct TWO “Exercises” per year, 1st planned, 2nd Surprise
Planning-Preparation • Business Impact Analysis (BIA) • Conducted high level BIA as part of recent study – Annual detailed BIA with every agency now in progress • Established annual BIA review process
Business impact analysis (BIA) and risk assessment approach: The analysis and report are structured around the following systems and critical, dependent business processes Technology View EnterpriseView Billing LAN WAN MANInternal/External ApplicationA Financials Agency/OperationsView MediaView CallCenters Applications Government/AgencyCommunications 3rdParty Technologies CustomerService Data Center/NOCs HighSpeed Telephony Services Telephony
Planning-Preparation • New approach to system criticality identification • Level 1 - < 5 minute RTO/RPO (0 downtime) • Level 2 – 8 hour or less RTO/RPO • Level 3 – 48 hour RTO/RPO • Level 4 – 72 hour RTO/RPO • Level 5 – NR – No specific disaster recovery requirements
Planning-Preparation • Implemented new WEB based Disaster Recovery Application and Inventory Planning Application
Strategies • Outside Analysis and Review • Confirmed what we thought we knew – our strengths and weaknesses • DR for Mainframe is mature, stable, and very supportable utilizing 3rd party services • DR for Distributed Systems is very complex and poorly suited for 3rd party services • Some existing technologies are still viable • New approaches are necessary for others • Migration to self-supporting recovery model is necessary, especially for Distributed Systems
Technologies • Construction of Second Data Center • Full Tier III facility* • Self-Recovery Model (just one example) • Each data center runs 50% of production • Each data center runs 50% of total dev/test/training • DR event – utilize dev/test/training hardware to recovery most critical systems • Various data replication schemes and technologies • Server Virtualization/Clustering over WAN/ HA technologies
Thoughts • Plan, Plan, Plan • Review, Review, Review • Test, Test, Test • Revise, Revise, Revise