1 / 22

Going Beyond Recovery to Continuity: Lessons Learned

This article discusses the lessons learned at The George Washington University in Washington, DC regarding business continuity. It covers various incidents and drivers for continuity, the role of IT, and the planning process.

mtapp
Download Presentation

Going Beyond Recovery to Continuity: Lessons Learned

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Going Beyond Recovery to Continuity: Lessons Learned Dave Swartz Vice President & CIO The George Washington University

  2. Brief Background on GW GW • Main campus • Washington, DC • ~100 buildings • Blocks from the White House, IMF/World Bank, State Dept. • 27,000 people • 20K students (50% UG and 50% graduate and professional students) • 7K faculty and staff • Of the 20K there are 8K resident students • Major medical center – the ER for the leadership of our government • Two other smaller campuses in region • 2.5 Gb into Internet and Internet-2 • 15K voice connections and 17K data connections • Two major data centers – 34 miles apart White House IMF/WB, State Dept. Pentagon

  3. Some Drivers for Business Continuity at GW • Explosions in Man Holes in Street • Recurring unexplained accumulations of flammable liquids in the storm drains explodes shutting power off a few buildings for days. • Flood hits Academic Center with Data Center • A backed up city sewer system causes a flood in a building not designed for a data center. • Change Management Issues • Our Facilities group is prone to taking significant actions without much notice, including cutting off power or cooling to a building. • Email Systems Failure • Lost the SAN and was down for 24 hours for basic email and it was 3 days until the archive could be restored. • Cybersecurity Incidents • After a major worm infestation and also a hack on a trusted host in 2000, GW creates its Information Security Program. • 9/11 • “The tragic events of Sept. 11 and their aftermath have resulted in changes in the way all of us conduct our lives,” said President Stephen Joel Trachtenberg. “Just as GW strives for academic excellence, we also want to take all appropriate steps to ensure the safety and well being of our community and the continued operation of the university”. • GW was close to ground zero that day and all land-based phones and cell phones were congested for much of the day. • Sarbanes-Oxley • A risk conscious Board of Trustees has lead to a number of initiatives to address BC at GW.

  4. Who Owns BC at GW? John Petrie, AVP for Public Safety & Emergency Mgt • John Petrie, AVP for Public Safety & Emergency Mgmt.,holds the AB degree from Villanova University and a master’s and doctorate from The Fletcher School of Law and Diplomacy. • A career Naval officer, he was the head of the Naval Station at Norfolk, the world’s largest Naval complex, and also professor and head of research at the War College. • The AVP position was created after 9/11 and was designed to broaden, coordinate, and execute the University’s crisis management, business continuity, emergency preparedness and public safety plans and activities. • “We need to have people at the local level comfortable with what’s expected of them and what they have the authority to do,” Petrie says. “If they are confident and comfortable, then the chances of their being able to prepare, respond, or recover are easier.” • John’s number one priority is the safety and welfare of people. • He sits on regional and national emergency management response groups and represents the regional universities in exercises. • References: • BC Plan - http://www.gwu.edu/~response/contents.cfm • Advisories and Alerts - http://www.gwu.edu/~gwalert/ John has help to lead the development and administration of BC plans and testing, and an integrated system of advisories, alerts and real-time communications.

  5. Role of IT in Campus BC • Address the risks of IT failures • IT has helped to coordinate and fund the development of the main 19 core office departmental plans • Many core departments had to be assisted to get their BC plans done since they felt IT had things under control, so why do they have to plan? • They also had difficulty freeing themselves from other priorities – needed their VP to make BC a priority! • IT has also helped to deliver: • Campus Alerts (web page, portal, email, 3rd party call service) • Back up web site • Redundant email system and broadcast server (reflector and Listserv) • Alternate routing to different area code for our main incoming and outgoing phone lines • Emergency intercom broadcasts over speaker phones • A network of Blackberries and support for management • Online directories and BC response plans • A fully configured and supported command center.

  6. The Planning Process • Identify sources of risks and plan accordingly • Provide assistance • Standard templates and questions to facilitate preparation of plans (available on request) • Expert assistance to develop plan • Review of plans • Enlist support • Of senior management, the Board and all core offices • Prioritize efforts • Not every department needs a comprehensive plan. At GW we identified 19 core offices that needed detailed plans. • Make the plan easily available • Test the plan and the ability to think on your feet regularly • Keep plans current • All plans require periodic review, validation and update. The online plan for GW is called the Incident Planning, Response, and Recovery Manual, included are individual BC Plans.

  7. Rebuild & Replace Disaster Recovery Tape backup and priority shipment of equipment Weeks to recovery Hot-Site Disaster Recovery Off site arrangements with a hot-site provider Several days to recovery High Availability Operations Redundant data centers, networks and telecom Less than one day and ideally less than a couple of hours to recovery. The GW IT Recovery Profile Hours to Recovery 420 (projected) Rebuild & Replace Hot-Site 84 High-Availability 12 < 2

  8. Dealing with RiskContinuity rather than Recovery • Common areas of IT risk were addressed with a focus on major risks and points of failure: • Data Center • Telecommunications • Network and ISP • Data • Security • Power and Cooling • Change and Service Management • Classrooms • Continuity of operations needs to be built into the architecture and culture from the bottom up. • If you live and use it day to day then it is less of a big deal when a disaster hits. • BC at a comprehensive local level is essential to enable IT to deliver the sustainability of data and information services.

  9. Data Center Redundancy • We have created dual data centers • separated by 34 miles • connected by a DWDM link over a redundant dark fiber ring • We split Test/Dev from the Prod instances. • We also deploy VMware and virtualize servers across centers. • Not all of production is at one site, but split on a 35-65% basis. • We mirror data between data centers. • We have staff split between centers. • We routinely test failover during maintenance and upgrades. • This design enables continuity of operations without the need to recover from most disasters.

  10. Telecommunications Redundancy • We have several PBX switches (Avaya S8700s) interconnected, load balanced, and spatially distributed. • Two are on the main campus and separated. The third is on a remote campus 34 miles away in a different area code. • We have the ability to re-route incoming and outgoing calls through different campuses and area codes. • There are redundant emergency 911 and analog lines as a back up to our main trunks. • Some specific phone numbers are protected and given regional priority for accessibility and sustainability during a major incident. • We maintain copper connections for voice to permit inline power off of diesel generators to 15,000 phones.

  11. Data Redundancy • All enterprise data is mirrored between data centers, including ERP, data marts, email, one-card, portal, and web systems. • The main campus file servers are automatically backed up. Legacy departmental systems are slowly transitioning to central support and sustainability – a difficult political process. • Desktops in many core offices have a standard image and automatically store to a central suite of file servers. • Critical documents are being stored online in an enterprise document management system and archived to tape. • We regularly test data backups to make sure we can restore from them. • One of the most critical aspects of continuity is rapid access to the data! On-site fire rated vault in addition to off-site storage

  12. Information Security • Protecting the university from security risks that can interrupt operations and cost millions of dollars in lost productivity and liability is an important priority in BC. • Like an onion, the best approach is defense in depth. • One of our newest efforts after securing campus file servers is our desktop initiatitive. • We now use Novell Patchlinks, Cisco Clean Access and IPS to automate updates, verify conformance to standards and non-infection. • As a result, desktop infection problems have declined to a trickle. • Creating a focused Information Security program, setting standards, and centralizing services, are critical to success. “Rounding Up Rogue Servers”, article in July 2005 Chronicle.

  13. Power and Cooling • Power Redundancy • Conditioned Commercial Power • 450KW Diesel Generator w/Maintenance Tap • Automatic Transfer Switch • Uninterruptible Power Supplies (UPS) • Multiple Power supplies in each computer system • 48 hours supply diesel (going to 96 hrs) with priority shipments from three regional vendors possible • Redundant Air Conditioning Systems • Chilled Water Plant & Two 60 Ton Dry Coolers • Glycol & Chilled Water Air Handlers

  14. Change & Service Management App. Change Control Prob Tickets & Service Orders Remedy Kintana Work Requests C3 Asset Management S/W License Mgmt Remedy TBD Upside Aperture Change Control via Integration Adoption of integrated change control is one of the major factors to improvement and reliability of operations.

  15. Classrooms • What happens if we lose some classroom space? How could we continue to conduct classes? • Using R25i (Resource25 3.3) to complement Schedule25 we can identify and reallocate any available university space to classrooms • Using Bb and Elluminate we can conduct classes virtually from home. • We are piloting this approach now for snow days and other unscheduled ad hoc gatherings such as study sessions. • We are also suggesting that faculty teach one virtual class every month so they have practice. • Podcasting = Apreso + iPods • GW is supporting Podcasting of its non-credit lecture series to provide access to recorded presentations. • Could this be expanded for credit classes? Depends on support from faculty.

  16. Selling BCnot the WHAT, but the HOW • Rational Approach • The risk or probability of the event multiplied by the potential loss provides a suggested magnitude to the investment for protecting a university from disaster. Not many use this approach. • Peer Group Benchmarks – A very common and accepted approach is to compare the university against the market basket of peer institutions to see what they are doing. • Leverage the Crisis – The emotional side of living through a crisis tends to ease the flow of funds, so capture the opportunity when it arises. • Partnering with the Board and Audit Team – The Board has the ability to drive improvements. The External and Internal Audit Teams are agents of the Board and should be viewed as a partner, not a threat, as they are often viewed.

  17. Risks of Complexity Virtualization, distant centers, and split operations add complexity, which has its own attendant risks. Standardization, documentation, and tight change control help to reduce risks from complexity.

  18. Factors Related to Distance • How far away is far enough for a second center? • GW has selected 34 miles • USC has designated a “bunker” just a few miles away • Others are saying 70+ miles. • It really depends • You need to consider the types of risks in your region. • The greater the distance • The greater the cost or lesser the functionality and immediacy of response. • You may want to • Have a secondary high-availability or hot-site nearby and a tertiary cold-site much farther away. • You need to consider • The impacts on your staff and their ability to make it to the different sites both for routine maintenance as well as during a disaster • Some types of clustering do not work at a distance • Real-time mirroring is also adversely affected by distance.

  19. Support those Blackberries • A critical element of the GW BC program is a network of Blackberries. All senior management at GW have them and use them everyday. • Blackberries are more like a laptop than a phone and require expert assistance • They have cell phone and radio capability • They can send and receive email and instant text messages • They have the ability to surf the web and access calendars, directories and online documents that can be used to support BC • We have a dedicated expert with backup to provide support to the Blackberries and the command centers.

  20. Doesn’t it cost a great deal? Cost • GW had a hot-site, • Costing several hundred thousand dollars per year. • Went to a high-availability 2nd site. • One-time cost about $1 million • The ongoing costs were not more than the previous base budget due to the reallocation of the funds from the hot-site contract. • Increase in base needed was: • $136K/yr: $1 million loaned at 6% over 10 years • To offset costs we are leasing excess space: • We are recovering the incremental operating costs of the 2nd site. • More reliable service without large additional costs - A NO-BRAINER! Expected Cost Curve GW Cost Curve Time to Restoration of Operations A myth propagated by hot-site vendors is that the cost of customer owned high-availability is prohibitive

  21. Partnerships • National Capital Regional Emergency Response Partnership • Emergency Response groups across the region coordinate efforts and share experiences • First Responder Access Card (FRAC) • Regional exercises • Information sharing with key groups • University Partnerships: • Cost and resource sharing or exchange programs • Georgetown University & GW back one another up • MAX (Mid-Atlantic Crossroads gigapop) • Vendor Partnerships: • Have helped GW identify best practices and utilize new technology useful to BC. • Their support in a disaster can be critical The FRAC helps to get approved personnel across road-blocks and barriers.

  22. Questions? Dave Swartz

More Related