270 likes | 405 Views
Achieving Continuity of Operations (COOP). Disaster Recovery of Technology Services : Issues Strategies Directions Presented by Dave Purdy 6-23-2005. Ever increasing need for COOP People, Data, and Services Availability.
E N D
Achieving Continuity of Operations (COOP) Disaster Recovery of Technology Services: Issues Strategies Directions Presented by Dave Purdy 6-23-2005
Ever increasing need for COOPPeople, Data, and Services Availability • Drivers/trends for improved recoverability and/or availability of Services: • Current measure increasingly deemed inadequate • Physical vs. Electronic transport of Data • Melding of “DR” and “Operational Availability” • Self Insurance for DR • Public Safety/Service Availability vs. Cost • Maturity in understanding COOP issues: • Recovery vs. Restart • Identification of App/DB inter-dependencies • DR vs. Operational Availability (HA) • Breaking down the problem: • Information Availability • Application Availability ISSUES
Production Availability and Disaster Recovery:Converging? Disaster: Natural or man-made (<1% of occurrences) • Flood, fire, earthquake • Contaminated building “DR” Insurance Unplanned occurrences: Failure(13% of occurrences) • Database corruption • Component failure • Human error “CA” Planned occurrences: Competing workloads (87% of occurrences) • Backup, reporting • Data warehouse extracts • Application and data restore “HA” ROI
Creating a context: Government moving up the Continuum Differing levels of IT Architectural dependency with regard to Availability Strategies: 100 % Procedural ( 0 % IT Architectural Redundancy ) Resources Low security Low security Consumer Goods Manufacturing Banks Financial Services Telecommunications Manual Essential Services Government, Airlines, Hospital Food Manufacturer 24 hrs x 7 days HighVolume Low Failsafe Low Volume HighFailsafe Manufacturing Non Critical Business Small Industries Retail Transportation Logistics Transparent Failsafe High Security 100 % Automatic (100 % IT Architectural Redundancy)
Availability Drivers • Increased realization that critical services depend on IT availability • Pervasive requirements to protect people and data • Increasing nature of real-time “transactions” • “Lost” transactions cannot be re-created • Increased recognition that traditional recovery from tape is no longer viable • New vision - Merger of production and DR disciplines to focusing on continuous availability • Public Service, Safety, and Inter-Agency dependencies driving criticality of COOP
Retrieve Tape Set Up SystemsRestore from Tape Tape BackupOffsite Storage Wks Days Hrs Mins Secs Secs Mins Hrs Days Wks RPO RTO Days Wks Days Wks Traditional Disaster Recovery: Tape • Tape Backup with Offsite Tape Storage • RPO = 24+ hours or time of last backup stored offsite • RTO = 24 - 96 hours or time required to restart operations • Transport tapes to recovery site • Setup systems to receive data • Restore from tape • Synchronize systems and DB for resumption
Consistency Group UNIX Consistency Group Mainframe Mainframe Windows Consistency Group Windows UNIX Consistency=Usability: This is not a platform or application issue…. Getting All the Data at the Same Time Across databases, applications, and platforms….
Patterns of DR Program Evolution: Insourced “CA” To 2/3 sites -Active -Triangulate Insourced DR & HA To 2nd Site -Passive -Active • Restore is very different than Restart • Testing effectiveness and control: Subset vs. Full / Hotsite vs. Internal • Application/Agency Inter-dependencies • Traditional recovery and restore techniques being deemed inadequate • Increased complexity (and benefit) in justifying “DR” versus “DR + HA” as 2nd site becomes more integrated with primary site Commercial Hotsite with Electronic Vaulting or Replication Commercial Hotsite QuickShip Offsite Vital Records ISSUES Local Remote Key Learnings:
A Practical Approach to Unifying Requirements and IT Capabilities for Mutually Agreement… Maximum Acceptable Data Loss (RPO) Maximum Acceptable Downtime (RTO) Customer Problem Area Sec. Mins Hours > 24 hrs. Zero LOCAL TAPE BACKUP & RECOVERY REMOTE Market Requirements LOCAL DISK DATA REPLICATION REMOTE LOCAL SERVER CLUSTERING & VIRTUALIZATION REMOTE
Network Out-Region Availability Strategies:Disaster Recovery (DR)High Availability (HA)“Continuous Availability” (CA) Secondary -or- Tertiary Asynch Asynch In-Region Commercial Hotsite -SunGard -IBM BRCS Asynch Primary Synch Secondary
Remote Replication Capability Continuum Summary Source Target Synchronous • No data exposure • Limited distance Limited Distance Target Asynchronous • Seconds of data exposure • No performance impact • Unlimited distance Source Unlimited Distance Asynchronous Point-in-Time • Hours of data exposure • No performance impact • Unlimited distance Source Target Unlimited Distance Prod Triangulated Synch & Asynch • Simultaneous Synchronous and Asynchronous • Three site awareness Limited Unlimited 2nd Site Unlimited Primarysite Long-distance site
Best Practices for Achieving Business Continuity • Determine requirements / service levels • System / application mapping • Validate ability to achieve service-level agreements • Evaluate costs / tradeoffs of technologies to meet service levels • Create right level of protection for your Agencies (or Inter-Agencies?) specific business and application requirements BEST PRACTICES • Integrate it • Across information storage platforms • Across processing infrastructure (servers, networks, applications) • Across data centers and geographic locations • Integrate with Change Management
Business Continuity Planning: Lessons from the Nation’s Capitol in the Post-9/11 World Mary Kaye Vavasour eGov Services Office of the Chief Technology Officer District of Columbia
1996-1999 Y2K made continuity a priority Internet made networks a focus and eGovernment a reality 2001 9-11 the unthinkable happened; security of data, network, and infrastructure became key to recovery 2002 Federal Patriot Act made Continuity planning a legal mandate 2003 Sarbanes-Oxley Act added more regulatory requirements Hurricane Isabel caused regional power outages that lasted 4-7 days Recent History as Context
High availability platform and procedures Proven Emergency Operations Process Detailed, service-based procedures Dedicated staff Regional coordination Frequent practice with planned events Focus on Continuity of Communications Public safety wireless network Public portal resiliency with specialized content High availability messaging platform Key Elements of Business Continuity Strategy
In-sourced, high availability Disaster Recovery Active-Active for availability Multiple servers behind hardware load balancers (millisecond fail-over) Separate web application and database tiers 95% of public web services covered (104 sites + main portal) Active-Passive for Disaster Recovery Two data centers Multiple types of replication Cluster synch for dynamic portal content MS/CRS for legacy applications and static pages Database tier uses SQL and Oracle replication Future: tertiary site for continuous availability of portal Centralized failure recovery process run by senior staff High Availability Platform; Centralized Process
Started with Y2K; focus on manual processes to back-up automated systems Post 9/11: focus on continuity of services Dedicated staff=DC EMA + agency representatives + key service providers (utilities, suppliers, Federal public safety, regional emergency agency staff, etc.) Hardened site 14 Emergency Liaison Officers for key services Two-tiered operational structure (EOC and JIC) Clearly defined decision-making process and lines of authority Redundant communication channels with all levels of responders, and the public at large Frequent practice, using planned events Comprehensive Emergency Operations Process
Public portal’s Emergency Center provides detailed content for emergency response plans Extensive use of GIS-based content Content to tailored to individual’s location Facilitates location of shelters, evacuation routes, and major transportation services Specialized “Emergency Mode” will take over entire portal during catastrophic events Specialized Content for Public Communication
Public safety wireless network for voice and data Federal and regional voice interoperability 99% of District geography is covered Dedicated transmission towers, and mobile repeater systems Signals can penetrate thick building walls, metro system tunnels, underground locations Focus on Continuity of Communications
District of Columbia Office of the Chief Technology Officer – 1
Coverage Improvement With New MPD Network Coverage Improvement With New Network District of Columbia Office of the Chief Technology Officer – 2
Active-Active failover, with load balancing on heartbeat for high availability Actual=99.99999% Active-Passive disaster recovery between two local data centers Future Tertiary site for continuous availability GOAL=Never Go Dark Public Portal Resiliency
Completely fault tolerant email system enables government officials to communicate and share data during significant outages High volume synchronous data replication between primary and secondary data centers, using EMC’s CLARiiON Mirror View Homeland Security funding ($900k) made public safety agencies the priority focus during implementation: MPD FEMS DMH CFSA DOC Can failover email accounts, and the most recent data from 4 hours prior to the outage High Availability Messaging Platform
People Process Practice Key Success Factors
Mary Kaye VavasourProgram Manager eGovernment Services Office of the Chief Technology Officer District of Columbia marykaye.vavasour@dc.gov