220 likes | 345 Views
Disaster Recovery. Broad Team – UCSD, UCOP, and others! (special credit to Kris Hafner & Elazar Harel) Presenter - Paul Weiss – Executive Director UCOP/IR&C Paul.weiss@ucop.edu. March 9-11, 2009 • Long Beach, CA • cenic09.cenic.org. Agenda. Business view and background as to how and why
E N D
Disaster Recovery Broad Team – UCSD, UCOP, and others! (special credit to Kris Hafner & Elazar Harel) Presenter - Paul Weiss – Executive DirectorUCOP/IR&C Paul.weiss@ucop.edu March 9-11, 2009 • Long Beach, CA • cenic09.cenic.org
Agenda • Business view and background as to how and why • The services portfolio • Technical details • Network implications • Lessons learned, going forward RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
Situation as of 2Q2006 • UCSD had almost no DR plan in place • UCOP used IBM contract in Colorado • Cost $200k / yr + $600k/month if ever used • Had insufficient gear and network reserved, cautiously estimate would be > 50% more cost if updated appropriately • 40 hrs of testing / year limit, difficult to schedule • RPO (Recovery Point Objective) <= 7 days • RTO (Recovery Time Objective) <= 3 days • Required UCOP personnel to activate and operate • Past testing indicated decent mainframe recovery plan in place, limited distributed system capability RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
DR Concept • UCOP required shorter RPO & RTO • Found trusted partner (UCSD) • Willingness to be “married” • Technical choices • Change management – ongoing • One “team” • Common principles • Use the WAN “stupid” RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
Keys to Approach • Buy enough storage, synchronize data in real or near real time, avoid loading data during an actual DR event • Mainframe – CBU option and buy memory • Other servers – buy sufficient gear to have capacity available to run at either location without having to repurpose servers during event • Must be able to test and retest – DR is not STATIC! The decision to do it! RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
Advantages of this Approach • Costs for UCOP are comparable to old DR plan • Costs for UCSD are <50% of a vendor solution • Capability is dramatically improved • RTO and RPO < 1 day (and will be far less) • Can test as often as needed (we need it!) • Equipment is there and operational • More services can be “easily” added (and have!) after the initial investment and can optimize over time • UC personnel “on other side” will assist in case of disaster, long term goal is to recover without any personnel from down location immediately available RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
Initial Critical Success Factors • UCOP assigned .5 FTE staff dedicated to drive effort • One Team – UCOP and UCSD • Agree to basic principles, including $$$ • Fight scope creep • Engage procurement personnel • Communicate, communicate, communicate • Test, Test, Test • The WAN! RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
Current UCOP to UCSD DR Portfolio • All Mainframe services (including 9 (and soon to be 10) PPS instances & UCRS) • AYSO and all Benefits services • Endowment and Investment Accounting System • Active Directory • VPN • Email & File sharing • Web Servers • Banking/Treasury Systems • Loan Programs • Risk Services RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
The Picture - Part I UCOP UCSD RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
Current UCSD to UCOP DR Portfolio • All Mainframe services (including HR, financial and student transactional backend systems) • All Web Based systems for HR/PPS, Financial, Student, Telecommunications billing, etc. • Google search appliances • Multi terabyte data warehouse • Multi terabyte production data for all mainframe and open systems • Dev and QA testing data and LPAR’s for mainframe applications • Stand Alone systems for Intl. Student tracking, Audit, Coeus, and DARS systems RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
Future UCSD to UCOP DR Portfolio Portal/CMS backup for campus, business and student portals Single Sign-on, roles, affiliates authentication/authorization failover VPN Active Directory Domain controllers Core MTA (Ironport for now) Blackberry Mailing lists Mailbox machines 11 RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
The Picture - Part II UCOP UCSD RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
Then it got interesting As positive word got out, more locations and functional areas realized that DR was achievable So… RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
Other DR services in place or committed too • UC Effort Reporting System (3Q2009) • UCOP Office of Technology Transfer Informix DB • UCOP IDP Shibboleth Server • UC Replacement Security Number (RSN) • UCOP TSM Server • UC Pathways (3Q2009) • UCSD Med Mainframe, PPRC • UCSB Distributed DNS Server • UCLA Continuing Education of the Bar • UCSD External Relations • UCDC File Server • Irvine Secondary DNS and Web Server • SD Coastal Data Information Program RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
And a Special Case! UCSB mainframe load Four Steps: DR from UCSB to UCOP utilizing PPRC Do failover test to UCOP, if fully successful, keep production at UCOP DR from UCOP to UCSD - trivial Turn off UCSB mainframe 15 RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
The Picture - Part III UCI San Diego Coastal UCOP UCSD UCSD External Relations UCSDMC UCDC UCSB UCLACEB RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
Services being Considered • UCOP California Institute for Energy and Environment • UCLA Med PPRC And what’s next? Broader discussions are now occurring, not just w/ UCOP, but between more and more UC players – nice “halo” effect with many leveraging the WAN! RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
Technical Details • SD & OP (and SB & SDMC) purchased comparable HW • IBM SAN & Cisco SAN switches, supports global mirroring (PPRC – Peer to Peer remote copy) • Mainframe – memory upgrade and CBU option – must have sufficient capacity on both sides to support total load • Worked through CENIC and local network teams to set up appropriate links for PPRC to ensure throughput • Wrote (and are writing) special monitoring tools • Setup remote tape capabilities so we don’t have to use outside vendor for offsite storage on tape copies • You need to remember that this hardware needs to be in normal refresh cycle just like hardware on your primary floor RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
Network concerns • Frame size • For lowtraffic, default end to end of 1500 bytes – works fine • OP/SD (more traffic) had to move into “jumbo frames” – 2300 bytes seems to work • On HPR today, need to move to DC • @ OP – likely upgrade to 10Gb, at 1 Gb now • Must refine SLA’s & due diligence • Acceptable catch up (RPO issue) • Better understanding of traffic RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
Implications due to “Success” • OP WAN capacity connection upgrade • Change management is a lot more complicated • Some technical “lock in” • Insufficient documentation and test plans – even now. • Better monitoring tools required • Org processes can be stressed RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
Lessons Learned • WAN is an underutilized/unrecognized asset • Geography is less of an inhibitor then many believe • This project will never be completed • Can/should continuously optimize this over time (examples – virtualization, better sharing) • Adding DR capability is easier after initial heavy lifting - e.g. Mainframe RIDING THE WAVES OF INNOVATION • cenic09.cenic.org