180 likes | 375 Views
Operational Recovery and Disaster Recovery Alternatives for VMware Infrastructures. Rob Zylowski Services Director – Virtualization and Director IP April 2009. Agenda. Operational Recovery Introduction Technologies Strategies Disaster Recovery Introduction Major Costs of DR
E N D
Operational Recovery and Disaster Recovery Alternatives for VMware Infrastructures Rob Zylowski Services Director – Virtualization and Director IP April 2009
Agenda • Operational Recovery • Introduction • Technologies • Strategies • Disaster Recovery • Introduction • Major Costs of DR • LUN Replication Alternative • Backup/Dedupe Alternative • VMware Site Recovery Manager • Alternative Strategies • Managing Multi-vendor storage
Introduction - Operational Recovery • Definition of Operational Recovery • Mine – Something very very important that is often overlooked in importance • Recovery of one or more applications and associated data to correct a failure such as a corrupt database, user error or hardware failure, within a datacenter. • Characteristics of Operational Recovery • Few organizations do it well • Can be complex requiring many manual steps which take significant amounts of time and resource • Not often well tested providing challenges for staff that are not certain of expected results • Should be developed into products by the App Developer but often is not
Introduction - Operational Recovery • Benefits Virtualization Provides for Operational Recovery • With VMware HA and ESX redundancy all systems are provided quick local recovery from server and network hardware failures • Servers are encapsulated into a small number of files that can be backed up and restored more easily than with physical servers • Entire servers can be backed up to disk for quick recovery • VMware Snapshots can be used before upgrades or significant system changes and the system can be rolled back to the point of the snapshot easily • Recovery is simplified as is testing of recovery because a VM can be restored and mounted with no real network access • Can significantly lower RTO • Provides some challenges for RPO that can be ameliorated with technology • Requirements to achieve benefits • Nearline Storage or SAN/NAS Snapshot Space • Significant amount of storage required for any online backup technology
Technologies for Operational Recovery • VMware VCB • Excellent Architecture • Tactically immature • Does not yet scale vertically • Script based - has integration issues • Does not work as well for Linux as Windows • Valuable when used to its strengths • Large number of files • Large file systems • Use it for what it’s good at and it will get better • Data Dedupe Targets and VTL’s • Integrated with backup software for example many vendors now have Symantec OST support like Data Domain • Provides benefits for Virtual and Physical Systems real life examples seem to be up 20 to 1 Reduction Ratios • People Work reduces better than natural data for example seismic data • Data that changes infrequently will also reduce more that frequently changing data • Significantly simplifies recovery by eliminating tape and the latency issues of RTO that tape switching causes
Technologies for Operational Recovery • Traditional Backup Vendors vs. Virtual Backup Vendors • Use depends on “Best in Breed” versus “Framework Standards Debate” • Most traditional products are becoming much more mature with VM’s • Some vendor solutions are becoming very feature rich especially when integrated with dedupe either hardware or software based • SAN/NAS Technologies • SAN/NAS Snapshots may be used for operational recovery but without an integrated backup application this can be difficult to mange • LUNs normally share many virtual machines making recovery of a single VM from snapshot challenging • Best used in conjunction with an integrated backup system for example many SAN vendors now have symantec OST support • FastScale • Shrinks VM’s by managing OS configuration to only what is required • Makes backup requirements much smaller for Linux OS’s
Alternative Strategies • Most Often Architected to Date • Backup Agent in VM for Most Backups • Matches Physical Server recovery standards • Use of VCB for large File Systems or File Systems with Millions of files • Service offering for Point in Time Image Backups kept for a period of time • Used for Upgrades or Major Change Rollback • Can be kept longer than Snapshots which affect performance over the long run • Change in Architecture Driven by Dedupe and Image Technologies • VCB & Point in Time Image Backups as above • Use Image Backup for applications that require very short RTO ie < 4 hours • Image backup technologies becoming mature • At many organizations % of systems virtualized is becoming very high allowing for economy of scale and change of standards • Dedupe allows for increased number of online backup days • Replication of deduped backups is efficient for WAN fulfilling offsite storage requirement • Integrated into DR process • Moderate implementations can move fully to image backups for VM’s but this is a challenge still for very large organizations
Future Advancements • Changes due in vSphere • vStorage Data Protection API’s enabling Backup vendors to bypass VCB • No longer require VCB proxy • Scalable High Performance solution • Based on Virtual Appliance • Better support for Windows (VSS & File Level Restores) than Linux • Preprocessing SW based Dedupe • Can Integrate with HW Dedupe • Should make recovery of VM’s simple and straight forward • Should perform much better than VCB • Significant IOP increase 3-4 times WOW! • 10 GB Ethernet moving into architectures as prices fall • Enhanced Network Performance coming from Cisco and HP • Near wire Speed with 1000V and Nexus Switches
Introduction – Disaster Recovery • Definition for Disaster Recovery • Mine – Something everyone plans for and few actually do • Wiki – (A good One) planning for resumption of applications, data, hardware, communications (such as networking) and other IT infrastructure. • The IT part of the greater “Continuity of Operations” which includes much more than IT • Characteristics of DR Implementations • Some organizations do it well • Usually when the cost of a failure is very high • Most don’t • Its not just storing backups on tapes offsite • Can be difficult to afford • Seen as insurance
Introduction – Disaster Recovery • Benefits of Virtualization for DR: • Virtualization can offer significant advantages for simplifying DR from a technology process perspective • Entire servers can be copied/replicated between sites and easily recovered • Can provides ubiquitous DR for all tiers • Can significantly lower RTO • Provides some challenges for RPO • Requirements to achieve benefits: • Significant bandwidth for replication • Significant investment in DR site infrastructure especially SAN and replication software from SAN or Software vendors
Major Costs of DR • Server HW • Made affordable with virtualization @ 25-30 to 1 consolidation • Storage • Tier 1 (DMX, HDS, etc..) replicated - very expensive • Tier 2 replicated - still very expensive • Tier3 SATA/FATA based more economical • Bandwidth • GB Speeds Regional - very expensive • GB Speeds Metro - moderately expensive • OC12 (600 Mb/s) Regional - very expensive • OC3 (150 Mb/s) Regional - expensive • Software • OS - Depends • Active / Active versus copies • VMware - Depends • If All failover its expensive • If once live DC backs up another and Dev/Test is mixed with production then its not • Applications Depends • Staff and Development - Expensive • Datacenter Space, power, cooling - Expensive
Infrastructure Comparison SAN/NAS LUN Replication • Almost immediate RTO • Supports tiers of RPO • Can be automated eg. VMware SRM • May or may not include quiescing applications • Relatively Expensive • Must Fail Over All Vm’s on a LUN • If Application Failover is required must segregate by application which can impact performance
Infrastructure Comparison Backup/Dedupe • RTO based on method of recovery • Multiple VCB Proxies • Non VCB Proxies • Media Server to Agent in VM • RTO higher than LUN replication in general but much shorter than tape • Usually Supports a single tier of RPO 1 day • Recovery is simple • Relatively Low Cost • Enhances Operational Recovery
DR Benefits of Backup Integrated De-duplication • Lower storage requirements for online backup providing lower cost or space for more backups • Lower bandwidth requirements for DR replication • Faster operational RTO from having online backups rather than going to tape • Simple operational recovery of entire VM based on image backups • Reference Architecture that does not require the same level of storage in DR site • Recovery of single VMs rather than entire LUNs as in the San replication model can allow for single applications to be failed over to DR without segregating the applications by LUN which can affect performance
VMware Site Recovery Manager • Manages the SAN Replication DR option for VMware ESX • Holds DR Recovery Plan Documentation • Automates Configuration and Setup of the DR process • Create and Test Recovery Plans • Report Results of Tests • Integration between VMware ESX, vCenter and SAN Vendors • Initiate failover when necessary, automating important changes like IP address assignments and performance allotments • Uses LUN replication. Failing over a single VM is possible but it will break replication and the other VMs on the LUN will be at risk therefore it is intended for entire site failover. It is possible to have a single LUN per VM or to segregate VMs on LUNs by application but this is hard to manage
Alternative DR Strategies • Active/Active DR • Mix Dev/Test/Prod in at least 2 DC’s • Sync DR both directions • On Clusters favor Prod VMs for Performance • During an event Dev/Test can be shutdown in favor of Prod • Significantly lower cost over Active/Passive • Tiered Solution with VMware SRM • Only designated systems are included • Normally based on low RTO • LUNs designated for replication or not based on SLA • Applications can be segregated onto LUNs if application failover and consistency is required • Must be careful of performance issues • Requires diligent monitoring for hot spots • May require adding LUNs for applications • Lower Tiered systems can be restored from normal backups or dedupe/VTL
Alternative DR Strategies • Solution based on Backup to Dedupe NAS or VTL • Can Support backup to image for low RTO and normal agent based file backups to same devices for longer RTO • Can be integrated with VCB • Can integrate with existing backup software and strategy • Has a longer RTO due to restore time • Can use software based replication products for small number of VM’s • Good if there is no SAN in the target site • Good option for smaller remote offices with VM Infrastructure
Multi-vendor Storage Resource Management Discussion • Jeff Phipps from Zot thought this would be interesting to discuss and I agreed • SRM originally caused considerable excitement • Multi-Vendor SRM is certainly something all large organizations could use • Industry Standard Monitoring and Alerting Frameworks/Applications based on SMI-S have been very disappointing with only sketchy support from the vendor community • These tools do not provide the power and performance required to manage a multi-vendor storage environment well • At GlassHouse we have switched to using vendor management products integrated into our monitoring platform via snmp and some email. • Still trying to use various tools for reporting but the vendor tools work best • The prevalent strategy we see here is to limit the number of platforms within your organization to ease the management burden associated with different platforms