130 likes | 432 Views
UCSD/UCOP Disaster Recovery Project. Review of objectives, rationale and accomplishments UCOP/UCSD team. Situation as of 2Q2006. UCSD had virtually no DR plan in place UCOP used IBM contract in Colorado Cost $200k / yr + $600k/month if ever used
E N D
UCSD/UCOP Disaster Recovery Project Review of objectives, rationale and accomplishments UCOP/UCSD team
Situation as of 2Q2006 • UCSD had virtually no DR plan in place • UCOP used IBM contract in Colorado • Cost $200k / yr + $600k/month if ever used • Had insufficient gear and network reserved, cautiously estimate would be > 50% more cost if updated appropriately • 40 hrs of testing / year limit, difficult to schedule • RPO (Recovery Point Objective) <= 7 days • RTO (Recovery Time Objective) <= 3 days • Required UCOP personnel to activate and operate • Past testing indicated decent mainframe recovery plan in place, limited distributed system capability
DR Concept • A window of opportunity was seized to implement real time DR capability between UCSD & UCOP due to: • Having trusted partner & high capacity WAN • UCOP required much shorter RPO and RTO and must expand scope of our DR capability (AYSO, ERS, EIAS, etc…) • Funding availability – the opportunity! • UCOP had budget by redirecting DR contract and leveraging storage purchase • UCSD leveraged storage purchase & server replacement timing • UCOP had experience from Colorado contract • Keys to approach: • Buy enough storage in order to synchronize data in real or near real time between sites and avoid any need to “load” data during an event • Leverage CBU mainframe option and add mainframe memory. • For all other servers – buy sufficient gear to have capacity available to run mission critical services at either location without having to repurpose servers during an event
Advantages of this DR approach • Costs for UCOP are comparable to old DR plan • Capability is dramatically improved • RTO and RPO < 1 day (actually far less) • Can test as often as needed (and we need it!) • Equipment is there and operational • No incremental cost during first 90 days of a disaster • More services can be easily added after the initial investment (labor and infrastructure) and easy to optimize over time • UC personnel “on other side” will assist in case of disaster, long term goal is to recover without any personnel from down location immediately available
The UCOP DR Portfolio • Examined past DR portfolio • IR&C inventoried and classified existing applications • Developed phased implementation plan • Current DR Apps • All Mainframe services (including 9 PPS instances & UCRS) • AYSO and all Benefits services • Endowment and Investment Accounting System • Infrastructure including TSM, Active Directory, VPN • Email • UCOP Web Servers • Banking/Treasury Systems • Loan Programs • Risk Services • Irvine Secondary DNS and Web Server • File Sharing
Anticipated Additions to UCOP DR Portfolio • New DR projects - committed • UC Effort Reporting System (3Q2008) • UCOP Office of Technology Transfer Informix DB • SD Coastal Data Information Program • UCSB - UCOP PPRC • UCOP iDP Shibboleth Server • UCOP TSM Server • UC Pathways (3Q2009) • New DR projects – under consideration • UCSD Med PPRC • UCSB Distributed DNS Server • UCOP California Institute for Energy and Environment • UCLA Med PPRC
Apps and Operational Testing Status • Completed: • Application Accessibility • Data Validation • Check printing at UCSB • Moved Check Stock • Operation Procedure • Email • File Sharing • Risk Services Apps • Outstanding: • Tape Drive for remote achieving • LPR / VPS remote printing • SD Enterprise Extender • Secure remote printing • Firewall addressing • Mainframe outgoing mail via SMTP • SSL Cert • UCSF FTP • ATL TMS library data integrity • Enhance SSE DR environment • Remote VPN Access • Batch Automation via Zeke
Process, Procedure and Documentation • Weekly Con Call w/ SD • Discuss problems and changes • Discuss upcoming technology changes • Coordinate scheduled outages and testing • Shared Folder • IPL / Shutdown Procedures • Remote Hands Tasks and Authorization List • DR Declaration Procedure • Bank Transmission Procedure • Website: http://www.ucop.edu/sysdev1/dr/drhome.html • Online access (network, server, cabinets diagrams etc..)
Technical Decisions Storage purchased SAN (2107) from IBM for majority of solution – used global mirroring UCOP mainframe to UCSD in real time All UCSD to UCOP in real time UCOP unix/linux using SSH RSYNCH server to server daily UCOP windows using server to server synch in real time All data encrypted except windows at this time. Windows will be encrypted soon Bandwidth Considerations During initial synch, 100 GB / hr (approx 300Mb/sec) during normal ongoing synch, anywhere from 0-100Mb / sec during test, will get out of synch, after test is complete and during catch up, about 300Mb/sec (This can be refined over time) Simplicity where possible to speed deployment Decided for clustered production environment, only fail over to single server DR environment. Have sufficient capacity to deliver full service, just no redundancy will not initially run production from both locations, DR site is just for failover where possible, have duplicate equipment to avoid finger pointing and need to worry about incompatibility Used our current technical staff, no consultants
Lessons Learned • Infrastructure / Operational • Physical security access personnel list • Coordination of scheduled PPRC impacting changes • Consistent method of accessing supported systems (i.e. MVS consoles) • Floor space availability for growth • Establish and coordinate HW/SW upgrades/purchases to alleviate compatibility issues and promote operational simplicity • Address current projects issues before adding new services to avoid delays in completing existing projects • Staffing to support additional services • Establish documentation policy (.i.e. format, depository and update cycles) • Network • Coordinate SAN zoning and VSAN numbering for SAN switches (allows shared management). • Coordinate IP addressing (not really a problem within UC campuses, but allows shared management). • Phased implementation has different needs at different phases (i.e. All-or-nothing failover needs operations support at remote end for DNS, etc.) • Document the failover method early and get everyone’s buy-in (assuming that IP addresses can easily be moved between sites may cause trouble later in the project).
Lessons Learned (cont) • Network (continued) • Global Copy can adversely affect seemingly unrelated applications. • Establish process and procedure in synchronizing Firewall address space with the DR site • Understand network realities: Latency vs bandwidth • Latency vs bandwidth • Ongoing load vs initial synch • Systems • Z9 CBU activation will "Perform a Model Conversion" at the host site, which will require the host site to obtain temporary licenses for certain software products. In addition, the host site will need to monitor their CPU to make sure it doesn't go over their original MSU to avoid additional charges. • Establish process in exporting SSL certificates • Sybase database loading using internal disk created loading delays while using its onboard disk. We're in the process moving its disk to our SNA disk environment to address the issue
Critical Success Factors • UCOP assigned dedicated staff to drive effort • One Team – UCOP and UCSD • Fight scope creep and go for simplicity • Clear mandate, objectives, and timeline • Communicate, communicate, communicate • Test, Test, Test • Engage procurement personnel