460 likes | 690 Views
Disaster Recovery 2.0. A paradigm shift in DR Architecture. Singapore, Q2 2013. Co-authors & Reviewers. Reviewers Michael White Staff Product Integration Architect, VMware , ca.linkedin.com/pub/michael-white/3/bb0/619 Michael Webster Strategic Architect, VMware, longwhiteclouds.com.
E N D
Disaster Recovery 2.0 A paradigm shift in DR Architecture. Singapore, Q2 2013
Co-authors & Reviewers • Reviewers • Michael White • Staff Product Integration Architect, VMware, ca.linkedin.com/pub/michael-white/3/bb0/619 • Michael Webster • Strategic Architect, VMware, longwhiteclouds.com VCAP-DCD, TOGAF Certified • Iwan ‘e1’ Rahabok • Staff SE, Strategic Accounts, VMware • e1@vmware.com| sg.linkedin.com/in/e1ang VCP, CCDP, CCNP • Lim Wei Chiang • SE, Strategic Accounts, VMware • wclim@vmware.com| sg.linkedin.com/in/weichiang VCDX, vExpert 2012
Business Requirements • It is similar to Insurance. • It’s no longer acceptable to run business without DR protection. • The question is now about… • How do we cut the DR cost & complexity? People cost, technology cost, etc. Protect the Business in the event of Disaster
Disaster did strike in Singapore • 29 June 2004 • Electricity Supply Interruption • More than 300,000 homes were left in the dark • About 30% of Singapore was affected. • If both your Prod and DR datacenters were on this 30%.... • Caused by the disruption of natural gas supply from West Natuna, Indonesia. A valve at the gas receiving station operated by ConocoPhillips tripped. Natural gas supply was disrupted causing 5 units of the combined-cycle gas turbines (CCGT) at Tuas Power Station, Power Seraya Power Station and SembCorp Cogen to trip. • Some of the CCGTs could not switch to diesel successfully. Investigation into the incident is in progress. • Other Similar Incidents • The first disruption in natural gas supply occurred on 5 Aug 2002 due to a tripping of a valve in the gas receiving station which led to a power blackout.
Disaster Recovery (DR) >< Disaster Avoidance (DA) • DA requires that Disaster mustbe avoidable. • DA implies that there is Time to respond to an impending Disaster. The time window must be large enough to evacuate all necessary system. • Once avoided, for all practical purpose, there is no more disaster. • There is no recovery required. • There is no panic & chaos. • DA is about Preventing (no downtime). DR is about Recovering(already down) • 2 opposite context. It is insufficient to have DA only. DA does not protect the business when Disaster strikes. Get DR in place first, then DA.
DR Context: It’s a Disaster, so… • It might strike when we’re not ready • E.g. IT team having offsite meeting, and next flight is 8 hours away. • Key technical personnels are not around (e.g. sick or holiday) • We can’t assume Production is up. • There might be nothing for us to evacuate or migrate to DR site. • Even if the servers are up, we might not even able to access it (e.g. network is down). • Even if it’s up, we can’t assume we have time to gracefully shutdown or migrate. • Shutting down multi-tier apps are complex and take time when you have 100s… • We can't assume certain system will not be affected • DR Exercise should involve entire datacenter. Assume the worst, and start from that point.
Singapore MAS Guidelines MAS is very clear that DR means Disaster has happened as there is outage. Clause 8.3.3 states Total Site should be tested. So if you are not doing entire DC test, you’re not in compliance.
DR: Assumptions • A company wide DR Solution shall assume: • Production is down or not accessible. • Entire datacenter, not just some systems. • Key personnels are not available • Storage admin, Network admin, AD admin, VMware admin, DBA, security, Windows admin, RHEL admin, etc. • Intelligence should be built into the system to eliminate reliance on human expert. • Manual Run Books are not 100% up to date • Manual documents (Word, Excel, etc) covering every steps to recover entire datacenter is prone to human error. It contains thousands of steps, written by multiple authors. • Automation & virtualisation reduce this risk.
DR Principles • To Business Users, actual DR experience must be identical to the Dry Run they experience • In panic or chaotic situation, users should deal with something they are trained with. • This means Dry Run has to simulate Production (without shutting down Production) • Dry Run must be done regularly. • This ensures: • New employees are covered. • Existing employees do not forget. • The procedures are not outdated (hence incorrect or damaging) • Annual is too long a gap, especially if many users or departments are involved. • DR System must be a replica of Production System • Testing with a system that is not identical to production deems the Dry Run invalid. • Manually maintain 2 copies of >100s servers, network, storage, security settings are classic examples of invalid Dry Run, as the DR System is not the Production system. • System >< Datacenter. Normally, the DR DC is smaller. System here means a collection of servers, storage, network, security that make up “an application from business point of view”.
Datacenter-wide DR Solution: Technical Requirements • Fully Automated • Eliminate reliance on many key personnels. • Eliminate outdated (hence misleading) manual runbooks. • Enable frequent Dry Run, with 0 impact to Production. • Production must not be shutdown, as this impacts the business. • Once you shutdown production, it is no longer a Dry Run. Actual Run is great, but it is not practical as Business will not allow entire datacenter to go down regularly just for IT to test infrastructure. • No clashing with Production Hostnames and IP addresses. • If Production is not impacted, then users can take time to test DR. No need to finish within certain time window anymore. • Scalable to entire datacenter • 1000s servers • Cover all aspect of infrastructure, not just server + storage. Network, Security, Backup have to included so entire datacenter can be failed over automatically.
DR 1.0 architecture (current thinking) • Typical DR 1.0 solution (at infrastructure layer) has the following properties:
DR 1.0 architecture: Limitations • Technically, it is not even a DR solution • We do not recover the Production System. We merely mount production Data on a different System • The only way for the System to be recovered is to do SAN boot on DR Site. • Can’t prove to audit that DR = Production. • Registry changes, config changes, etc are hard to track at OS and Application level. • Manual mapping of data drive to associated server on DR site. • Not a scalable solution as manual update don’t scale well to 1000s servers. • Heavy on scripting, which are not tested regularly. • DR Testing relies heavily on IT expertise.
Solution: replicate System + Data, not just data drive (LUN). OS, Apps, settings, firewall, load balancer, etc. Implication of the solution: If Production network is not stretched, the server will be unreachable. Changing IP will break Application. If Production network is stretched, IP Address and Hostname will conflict with Production. Changing Hostname will definitely break Application. Stretched L2 network is not a full solution. Entire LAN isolation is the solution. R01: DR Copy = Production Copy • Solution: Entire Dry Run network must be isolated (bubble network) • No conflict with Production, as it’s actually identical. It’s a shadow of Production LAN. • All network services (AD, DNS, DHCP, Proxy) must exist in the Shadow Prod LAN. • Implication of the solution: • For VM, this is easily done via vSphere and SRM • For Physical Servers, they need to be connected to Dry Run LAN. Permanent connection simplifies and eliminate risk of accidental update to production.
R02: Identical User Experience desktop.ABC Corp.com Production desktop pools DR Test desktop pools (on-demand) Desktop-DRTest.ABC Corp.com • VDI is a natural companion to DR as it makes the “front-end” experience seamless. • Users use Virtual Desktop as their day to day desktop. • VDI enables us to DR the desktop too. • During Dry Run • Users connect to desktop.vmware.com for production and desktop-DR.vmware.com for Dry Run. Having 2 desktops mean the environment is completely isolated. • During actual Disaster • Desktop-DR.vmware.com is renamed to desktop.vmware.com as the original desktop.vmware.com is down (affected by the same DR). Users connect to desktop.vmware.com, just like they do in their day to day, hence creating an identical experience.
R03: No impact on Production during Dry Run • To achieve the above, the DR Solution: • Cannot require Production be shutdown or stopped. It must be Business as Usual. • Must be an independent, full copy altogether, no reliance on Production component. • Network, security, AD, DNS, Load Balancer, etc.
R04: Frequent Dry Run • To achieve the above, the DR Solution cannot: • Be laborious or prone to human error. A fully automated solution address this. • Touch production system or network. So it has to be an isolated environment. A Shadow Production LAN solves this. • VMware SRM enables the automation component for VM. • Physical Machines are harder to isolate. Need physical isolation. You should have the full confidence that the Actual Fail Over will work. This can only be achieved if you can do frequent dry run.
Solution: Isolating ESXi Host (1 physical box) DR Test LAN Portgroup. VLAN 30 Type: VM Network Non-Prod LAN Portgroup. VLAN 10 Type: VM Network Production LAN Portgroup. VLAN 20 Type: VM Network ESXi Mgmt Portgroup. VLAN 40 Type: vmkernel Network To physical switches. Main network on Site 2. To physical switches Isolated DR Test network Connected Prod Network
Solution: Dealing with Physical Servers Singapore (Prod Site) Singapore (DR Site) Shadow Production LAN CRM-Web-Server.vmware.com10.10.10.10 CRM-Web-Server.vmware.com10.10.10.10 CRM-App-Server.vmware.com 10.10.10.20 CRM-App-Server.vmware.com 10.10.10.20 CRM-DB-Server.vmware.com 10.10.10.30 CRM-DB-Server.vmware.com 10.10.10.30 CRM-DB-Server-Test.vmware.com 20.20.20.30
Physical Servers: Dual boot option This VM is a Jump Box. Without a Jump Box, we cannot access Shadow Production LAN during Dry Run. It runs on ESXi which is connected to both LANs. Shadow Production LAN (10.10.10.x) LAN on Datacenter 2 (20.20.20.x) Need to add 2 NIC for each server • Physical Server must be dual-boot (OS): • Normal Operation: Test/Dev environment (default boot) • Dry Run or DR: Shadow Production network OS drive + Data
Physical Servers: Dual partition option This VM is a Jump Box. Without a Jump Box, we cannot access Shadow Production LAN during Dry Run. It runs on ESXi which is connected to both LANs. Shadow Production LAN (10.10.10.x) LAN on Datacenter 2 (20.20.20.x) 1 physical box DR Partition Test/Dev Partition
Typical Physical Network: it’s 1 network Singapore (Prod Site) Country X (any site) Singapore (DR Site) Production Networks Production PMs Production VMs Production PMs Production VMs Production PMs Production VMs AD/DNS Non-AD DNS AD/DNS Non-AD DNS AD/DNS Non-AD DNS ABC Corp operates in many countries in Asia, with Singapore being the HQ. A system may consist of multiple servers from the more than 1 country. DNS service for Windows is provided by MS AD. DNS service for non Windows is provided by non MS AD. Users (from any country) can access any servers (physical or virtual) on any country as basically there is only 1 “network”. There is routing to connect various LAN. In 1 “network”, we can’t have 2 machines with same host name or same IP. Each LAN has its own network address. Hence changing of IP address is required when moving from Prod Site to DR Site. Users Site
Site 2 needs to have 2 distinct Network This VM is a Jump Box. Without a Jump Box, we cannot access Shadow Production LAN during Dry Run. It runs on ESXi which is connected to both LANs. Shadow Production LAN (10.10.10.x) LAN on Datacenter 2 (20.20.20.x) DR Server Test/Dev Server
Mode 1: Normal Operation or During Dry Run Datacenter: Site 2 Datacenter: Site 1 Shadow Production LAN (10.10.10.x) Production LAN (10.10.10.x) x Jump box Non Prod LAN (20.20.20.x) Users Site Desktop LAN (30.30.30.x)
Mode 2: Partial DR Datacenter: Site 2 Datacenter: Site 1 Production LAN (10.10.10.x) Non Prod LAN (20.20.20.x) Users Site Desktop LAN (30.30.30.x)
DA • From the view of DR
DA & DR in virtual environment • DR and DA solution do not fit well together in vCloud Suite 5.1 • DA requires 1 vCenter • DA needs long distance migration, which don’t work across 2 vCenters. • DR requires 2 vCenters. • vCenter prevents the same VM to appear 2x in the same vCenter. • There is confusion on DR + DA • You cannot have DA + DR on the same “system”. You need 3 instances. • 1 primary • 1 secondary for DR purpose • 1 secondary for DA purpose. • Next slide explains limitations of some DA solution for DR use case. • This is not to criticise the DA solution, as it is a good solution for DA use case.
DA Solution: Stretched Cluster (+ Long Distance vMotion) • When actual DR strikes… • We can’t assume Production is up. Hence vMotion is not a solution. • HA will kick in and boot all VMs. Orders will not be honoured. • Challenge of the above solution: How do we Test? • DR Solution must be tested regularly as per Requirement R04. • The test must be identical from user point of view, as per Requirement R02. • So the test will have to be like this: • Cut replication, then mount the LUNs, then add VMs into VC, boot the VMs. • But… we cannot mount the LUNs the same vCenter as they have the same signature! Even if we can, we must know the exact placement of each VMs (which is complex). Even if we can, we cannot boot 2 VMs on the samevCenter!This means Production VMs must be down. This fails Requirement R03. Conclusion: Stretched Cluster does not even qualify as DR Solution as it can’t be tested & it’s 100% manual.
DA Solution: 2 Clusters in 1 VC (+ Long Distance vMotion) • This is a variant of Stretched Cluster. • It fixes the risk & complexity of Stretched Cluster. And no performance impact of uncontrolled long distance vMotion. • When actual DR strikes… • We can’t assume Production is up. Hence vMotion is not a solution. • HA will not even kick in as it’s separate cluster. In fact, VMs will be in error state, appearing italized in vCenters. Can they be removed from vCenter since the host is not responding? • We can assume vCenter will be up on DR Site, if it’s separately protected by Heartbeat. • Challenge of the above solution: How do we Test? • All issues facing Stretched Cluster apply. Conclusion: 2-Cluster is inferior to Stretched Cluster from DR point of view
Active/Active or Active/Passive • Which one makes sense?
Just what is a Software-Defined Datacenter anyway? Virtual Datacenter Physical Datacenter 2 Physical Datacenter 1 Physical Compute Function Physical Compute Function Shared Nothing Architecture. No stretched cluster between 2 physical DC. Each site has its own vCenter. Compute Vendor 1 Compute Vendor 2 Compute Vendor 1 Compute Vendor 2 Physical Network Function Physical Network Function Shared Nothing Architecture. Not stretched between 2 physical DC. Production might be 10.10.x.x. DR might be 20.20.x.x Network Vendor 1 Network Vendor 2 Network Vendor 1 Network Vendor 2 Physical Storage Function Physical Storage Function Shared Nothing Architecture. No replication between 2 physical DC. Production might be FC. DR might be iSCSI. Storage Vendor 1 Storage Vendor 2 Storage Vendor 1 Storage Vendor 2
Background • Active/Active Datacenter has many level of definition: • Both DC are actively running workload, so one is not idle. • This means Site 2 can be running non Production workload, like Test/Dev and DR. • Both DC are actively running Production workload • Build from previous, this means Site 2 must run Production workload. • Both DC are actively running Production workload, with cluster failover. • Build from previous, the same App run on both side. But the instance on Site 2 is not serving users. It’s waiting for an application-level failover. • This is typicaly done via geo-cluster solution. • Both DC are actively running Production workload, with A/A aplication-level • Both Apps are running. Normally done via global Load Balancer. • No need to failover as each App is “complete”. It has the full data, and it does not need to tell the other App when its data is updated. No transaction level integrity required. • This is the ideal. But most apps cannot do this as the data cannot be split. You can only have 1 data. In vSphere context, this is what it meansby Active/Active vSphere. Both vSphere are actively running Production VMs
A closer look at Active/Active vCenter vCenter Lots of traffic between: Prod to Prod T/D to T/D 250 Prod VMs 500 Test/Dev VMs 500 Test/Dev VMs 250 Prod VMs T/D Clusters T/D Clusters Prod Clusters Prod Clusters vCenter vCenter 1000 Test/Dev VMs 500 Prod VMs T/D Clusters + DR Cluster Prod Clusters
MAS TRM Guideline It states “near” 0, not 0. It states “should”, not “must”. It states “critical”, not all systems. So A/A is only for a subset. This points to an Application-level solution, not Infrastructure-level. We can add this capability without changing the architecture, as shown on next slide.
Adding Active/Active to a mostly Active/Passive vSphere vCenter vCenter 1000 Test/Dev VMs 500 Prod VMs T/D Clusters Prod Clusters vCenter vCenter Global LB Global LB 50 VMs 1000 Test/Dev VMs 450 Prod VMs T/D Clusters Prod Clusters 1 Cluster
Thank You Next few slides is just to give perspective from a different solution (NAT based). This solution does not leverage Network Virtualisation. It is based on the classical networking solution.
Application Analysis • SRM should be done after Business Impact Analysis • BIA will list all the apps, owners, RTO, RPO, regulatory requirements, dependancy, etc. • Some applications are important to business, but not have high DR priority. • These are normally scheduled/batched apps, like Payroll, Employee Appraisal • Group application by Services • 1 Services has many VM. • Put them in the same Datastore Group, as they will need to fail over together. • For each app to be protected, document the dependancy • Upstream and downstream. • A large multi-tier app can easily span > 10 VM. • Some apps require basic services like AD and DNS to be up. • Type of “DR” • Sometimes there is not a big disaster but a small one. Examples: • The core switch is going to be out for 12 hours. • Annual power cycle on the entire building. This happens at Suntec City, which is considered the vertical Silicon Valley of Singapore. • Define all the Recovery Plans • Consider the time it takes CIO to decide to trigger DR as part of RTO/RPO. • Do you have enough CPU/RAM to boot the Production VM during Test Run? • Identify DR VMs can be suspended. Add their total Reservation
Pre-Failover Global DNS Load Balancer User 10.30.30.30 DNS Response: Virtual IP 1 10.10.10.10 DNS Query: www.abc.com HTTP GET: 10.10.10.10 DR Site Prod Site VIP 2 VIP 1 • SOURCE NAT • Source IP Changed: • 10.30.30.30 => 10.20.20.20 • LOAD BALANCE • VIP Mapped to server IP: • 10.10.10.10 => 10.20.20.31 Load Balancer Load Balancer SNAT SNAT Production PMs Production VMs
Post-Failover Global DNS Load Balancer User 10.30.30.30 DNS Response: Virtual IP 2 192.168.10.10 DNS Query: www.abc.com HTTP GET: 192.168.10.10 DR Site Prod Site VIP 2 VIP 1 • SOURCE NAT • Source IP Changed: • 10.30.30.30 => 10.20.20.20 • LOAD BALANCE • VIP Mapped to server IP: • 192.168.10.10 => 10.20.20.31 Load Balancer Load Balancer SNAT SNAT Production PMs Production VMs
DR Dry Run Global DNS Load Balancer User 10.30.30.30 DNS Response: Virtual IP 2 192.168.10.10 DNS Query: www-dr-test.abc.com HTTP GET: 192.168.10.10 DR Site Prod Site VIP 2 VIP 1 • SOURCE NAT • Source IP Changed: • 10.30.30.30 => 10.20.20.20 • LOAD BALANCE • VIP Mapped to server IP: • 192.168.10.10 => 10.20.20.31 Load Balancer Load Balancer SNAT SNAT DR Test PMs DR Test VMs Production PMs Production VMs
L3 & Firewall • L2 Switching • Hypervisor • Compute & Storage • Customer A • 192.168.10.0/23 Design Challenges… APP APP APP APP APP APP APP APP APP VAPP VAPP VAPP • 11.1 • 10.1 OS OS OS OS OS OS OS OS OS • Organizational VDC A • VLAN C • VLAN B • Customer B • 10.1.1.0/24 • IPSEC VPN(Internet) • MPLS • Organizational VDC B • Customer C • Organizational VDC C • 10.1.1.0/24 • SHARED INFRASTRUCTURE 3 2 1 • Multiple VLANs segments &VLAN Routing per customerwithout use of NAT • Provide hardware level isolation to avoid overlapping IP subnets in a shared switching infrastructure • Eliminate need for customersto re-IP their virtual servers whenfailing over to the DR site
L2 Switching • vCloud / vShield • Hypervisor • Storage • Customer A • 192.168.10.0/24 vCloud 5.1 / vShield 5.1 Solution… APP APP APP APP APP APP APP APP APP VAPP VAPP VAPP • VXLAN • VXLAN OS OS OS OS OS OS OS OS OS • VXLAN • VXLAN • Pass-ThruVLAN • Organizational VDC A • Internet VLAN • Customer B • 10.1.1.0/24 • IPSEC VPN(Internet) • MPLS • Organizational VDC B • Customer C • Organizational VDC C • 10.1.1.0/24 1 2 3 • Use vShield Edge Gateway and VXLAN to provide multiple routable segments & isolation within an organizational VDC • Simplify L2 edge configuration by using simple pass-thru VLANs for customer WAN termination and segmentation • Consolidate compute and network services into a single common hardware platform