Multi-Site Clustering for Hyper-V Disaster Recovery

Multi-Site Clusteringfor Hyper-VDisaster Recovery Greg Shields, MVP, vExpertSenior PartnerConcentrated Technology www.ConcentratedTech.com @ConcentratdGreg

About the speaker Over 15 years of Windows experience • Administrator– Managed environments ranging from a few dozen to many thousands of users… • Consultant – Hands-on and Strategic… • Speaker – TechMentor, Tech Ed, Windows Connections, MMS, VMworld, ISACA, others… • Analyst/Author – Fourteen books and counting… • Columnist – TechNet Magazine, Redmond Magazine,Windows IT Pro Magazine, TechTarget Online, others… • All-around good guy…

What Makes a Disaster? Which of the following would you consider a disaster? • Impacts your datacenter and causes damage. That damage causes the entire processing of that datacenter to cease • Interrupts the functionality of your datacenter for an extended period of time • It’s immediately ceasing all processing on that server • It causes problems with a service, shutting down that service and preventing some action from occurring on the server It causes a server or an entire rack of servers to inadvertently and rapidly power down

What Makes a Disaster? Which of the following would you consider a disaster? • It’s immediately ceasing all processing on that server Just a bad day… • It causes problems with a service, shutting down that service and preventing some action from occurring on the server It causes a server or an entire rack of servers to inadvertently and rapidly power down

What Makes a Disaster? • Your decision to “declare a disaster” and move to “disaster ops” is a major one • The technologies used for disaster protection are different than those used for high-availability • More complex • More expensive • Failover and failback processes involve more thought • You might not be able to just “fail back” with a click of a button

Multi-Site Hyper-V == Single-Site Hyper-V Multi-site Hyper-V looks very much the same as single-site Hyper-V • Microsoft has not done a good job of explaining this fact! • Some Hyper-V hosts • Some networking and storage • Virtual machines that Live Migrate around But there are some major differences too… • VMs can Live Migrate across sites • Sites typically have different subnet arrangements • Data in the primary site must be replaced with the DR site • Clients need to know where your servers go!

Constructing Site-Proof Hyper-V: Three Things At a very high level, Hyper-V disaster recovery is three things • Once you have these three things, layering Hyper-V atop is easy. • Target • Servers & • Cluster • Replication • mechanism • Storage • mechanism

Constructing Site-Proof Hyper-V: Three Things Replication Mechanism Storage Device(s) Target Servers

Thing 1: A Storage Mechanism Typically, two SANs in two different locations Backup SAN doesn’t necessarily need to be of the same size or speed as the primary SAN • FibreChannel, • iSCSI, • FCoE, • heck JBOD • Replicated • ≠ • Full data • (not always) • Similar model • or • manufacturer • DR • – not for • everything! • Similarity  • proper • replication • DR Environments: • Where Old SANs • Go To Die!

Thing 2: A Replication Mechanism Replication between SANs must occur 2. Asynchronously 1. Synchronously • Changes are made on one node at a time • Subsequent changes on primary SAN must wait for ACK from backup SAN • Changes on backup SAN will eventually be written • Changes queued at primary SAN to be transferred at intervals

Thing 2: A Replication Mechanism 1. Synchronously • Changes are made on one node at a time. Subsequent changes on primary SAN must wait for ACK from backup SAN.

Thing 2: A Replication Mechanism 2. Asynchronously • Changes on backup SAN will eventually be written. Are queued at primary SAN to be transferred at intervals.

Food for Thought Which would you choose? Why? Synchronous Asynchronous • Potential for loss of data during a failure • Leverages smaller-bandwidth connections, more tolerant of latency • No performance impact • Potential to stretch across longer distances • Assures no loss of data • Requires a high-bandwidth and low-latency connection • Write and acknowledgement latencies impact performance • Requires shorter distances between storage devices Your Recovery Point Objective makes this decision…

Thing 2½: Replication Processing Location There are also two locations for replication processing… 1. Storage Layer • Replication processing is handled by the SAN itself • Agents are often installed to virtual hosts or machines to ensure crash consistency • Easier to set up, fewer moving parts. More scalable • Concerns about crash consistency 2. OS / Application Layer • Replication processing is handled by software in the VM OS • This software also operates as the agent • More challenging to set up, more moving parts. More installations to manage/monitor. Scalability and cost are linear • Fewer concerns about crash consistency

Thing 3: Target Servers and a Cluster • Finally are target servers and a cluster in the backup site.

Clustering’s Sordid History • - Microsoft Cluster Service “Wolfpack” • - “As the corporate expert in Windows clustering, I recommend you don’t use Windows clustering” Windows NT 4.0 Windows 2000 • Greater availability, scalability. Still painful • - Added iSCSI storage to traditional FibreChannel • - SCSI Resets still used as method of last resort (painful) Windows 2003 • - Eliminated use of SCSI Resets • - Eliminated full-solution HCL requirement • - Added Cluster Validation Wizard and pre-cluster tests • - Clusters can now span subnets (ta-da!) Windows 2008 • - Improvements to Cluster Validation Wizard and Migration Wizard • - Additional cluster services • - Cluster Shared Volumes (!) and Live Migration (!) Windows 2008 R2

So, What IS a Cluster?

So, What IS a Cluster? Quorum Drive & Storage for Hyper-V VMs

So, What IS a Multi-Site Cluster?

Quorum: Clustering’s Most Confusing Configuration • Ever been to a Kiwanis meeting…? • A cluster “exists” because it has quorum between its members. Quorum is achieved via a voting process • If a cluster “loses quorum”, the entire cluster shuts down and ceases to exist. This happens until quorum is regained • Multiple quorum models exist • Different clusters – • different rules • Different than • resource failover • Different clubs – • different rules

Four Options for Quorum • Node and Disk Majority • Node Majority • Node and File Share Majority • No Majority: Disk Only

Quorum in Multi-Site Clusters • Node and Disk Majority • Node Majority • Node and File Share Majority • No Majority: Disk Only Microsoft recommends using the Node and File Share Majority model for multi-site clusters • This model provides the best protection for a full-site outage • Full-site outage requires a file share witness in a third geographic location

Quorum in Multi-Site Clusters • Use the Node and File Share Quorum • Prevents entire-site outage from impacting quorum. • Enables creation of multiple clusters if necessary. Third Site for Witness Server

I Need a Third Site? Seriously? Here’s where Microsoft’s ridiculous quorum notion gets unnecessarily complicated… • What happens if you put the quorum’s file share in the primary site? • The secondary site might not automatically come online after a primary site failure • Votes in secondary site < Votes in primary site

I Need a Third Site? Seriously? Here’s where Microsoft’s ridiculous quorum notion gets unnecessarily complicated… • What happens if you put the quorum’s file share in the secondary site? • A failure in the secondary site could cause the primary site to go down. • Votes in secondary site > votes in primary site. This problem gets even weirder as time passes and the number of servers changes in each site

I Need a Third Site? Seriously? Third Site for Witness Server

Multi-Site Cluster Tips/Tricks Manage Preferred Owners & Persistent Mode options • Make sure your servers failover to servers in the samesite first • But also make sure theyhave options on failing overelsewhere

Multi-Site Cluster Tips/Tricks Consider carefully the effects of Failback • Failback is a great solutionfor resetting after a failure • But Failback can be amassive problem-causer as well • Its effects are particularlypronounced in Multi-Site Clusters • Recommendation: Turn it off,(until you’re ready)

More Multi-Site Cluster Tips/Tricks Resist creating clusters that support other services • A Hyper-V cluster is a Hyper-V cluster is a Hyper-V cluster Use disk “dependencies” as Affinity/Anti-Affinity rules • Hyper-V all by itself doesn’t have an elegant way to affinitize • Setting disk dependencies against each other is a work-around Add Servers in Pairs • Ensures that a server loss won’t cause site split brain • This is less a problem with the File Share Witness configuration

Multi-Site Cluster Tips/Tricks • Segregate traffic!!!

Most Important! Ensure that networking remains available when VMs migrate from primary to backup site • Crossing subnets also means: changing IP address, subnet mask, gateway, etc., at new site • Automatically done by using DHCP and dynamic DNS OR must be manually updated • DNS replication is also a problem. Clients will require time to update their local cache • Consider reducing DNS TTL or clearing client cache • Clustering can span subnets!- This is good, but only if you plan for it…

Multi-Site Clusteringfor Hyper-VDisaster Recovery Greg Shields, MVP, vExpertSenior PartnerConcentrated Technology www.ConcentratedTech.com @ConcentratdGreg

Enjoy and share this material • Feel free to promote this material • Recommend your peers to pass certification • Blog, Tweet and share this material and your experience on Facebook • You’re an Expert? We will be happy to have you as Backup Academy • contributor. Apply here. Web: http://www.backupacademy.com E-mail: feedback@backupacademy.com Twitter: BckpAcademy Facebook: backup.academy

Multi-Site Clustering for Hyper-V Disaster Recovery