350 likes | 601 Views
SecondSite: Disaster Tolerance as a Service. Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew W arfield. Failures in a Datacenter. Tolerating Failures in a Datacenter. REMUS. Initial idea behind Remus was to tolerate Datacenter level failures. Can A Whole Datacenter Fail ? . Yes!
E N D
SecondSite: Disaster Tolerance as a Service Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield
Tolerating Failures in a Datacenter REMUS Initial idea behind Remus was to tolerate Datacenter level failures.
Can A Whole Datacenter Fail ? Yes! It’s a “Disaster”!
Disasters “Truck driver in Texas kills all the websites you really use” …Southlake FD found that he had low blood sugar - valleywag.com Illustrative Image courtesy of TangoPango, Flickr. “Our Internet infrastructure, despite all the talk, is as fragile as a fine porcelain cup on the roof of a car zipping across a pot-holed goat track. A single truck driver can take out sites like 37Signals in a snap.” - Om Malik, GigaOM
Disasters.. Water-main break cripples Dallas County computers, operations The county's criminal justice system nearly ground to a halt, as paper processing from another era led to lengthy delays - keeping some prisoners in jail longer than normal. - Dallas Morning News, Jun 2010
More Fodder Back Home • “ • An explosion… near our server bank … electrical box containing 580 fiber cables. • electrical box … was covered in asbestos… mandated the wearing of hazmat suits .... • Worse yet, the dynamic rerouting —which is the hallmark of the internet … did not function. • In other words, the perfect storm. Oh well. S*it happens. ’’ • -Dan Empfield, Slowswitch.com - a Gossamer Threads customer.
Disaster Recovery – The Old Fashioned Way • Storage replication between a primary and backup site. • Manually restore physical servers from backup images. • Data Loss and Long Outage periods. • Expensive Hardware – Storage Arrays, Replicators, etc.
Array Replication State of the Art Disaster Recovery X ProtectedSite RecoverySite Site RecoveryManager Site RecoveryManager VirtualCenter VirtualCenter VMs offline VMs online in Protected Site VMs powered on VMs become unavailable Datastore Groups Datastore Groups Source: VMWare Site Recovery Manager – Technical Overview
Problems with Existing Solutions • Data Loss & Service Disruption • (RPO ~15min, RTO ~few hours) • Complicated Recovery Planning • (e.g. service A needs to be up before B, etc.) • Application Level Recovery • Bottom Line: Current State of DR is • Complicated • Expensive • Not suitable for a general purpose cloud-level offering.
Disaster Tolerance as a Service ? Our Vision
Overview • A Case for Commoditizing Disaster Tolerance • SecondSite – System Design • Evaluation & Experiences
Primary & Backup Sites 5ms RTT
Failover & Failback without Outage Primary Site: Vancouver Backup Site : Kamloops Primary Site: Vancouver Primary Site: Kamloops Primary Site: Kamloops Backup Site : Vancouver • Complete State Recovery (CPU, disk, memory, network) • No Application Level Recovery
Main Contributions • Remus (NSDI ’08) • Checkpoint based State Replication • Fully Transparent HA • Recovery Consistency • No Application level recovery • RemusDB (VLDB’11) • Optimize Server Latency • Reduce Replication Bandwidth by up to 80% using • Page Delta Compression • Disk Read Tracking • SecondSite (VEE’12) • Failover Arbitration in Wide Area • Stateful Network Failover over Wide Area
Failure Detection in Remus External Network • A pair of independent dedicated NICs carry replication traffic. • Backup declares Primary failure only if • It cannot reach Primary via NIC 1 and NIC2 • It can reach External N/W via NIC1 • Failure of Replication link alone results in Backup shutdown. • Split Brain occurs only when both NICs/links fail. LAN Primary Backup NIC1 NIC1 Checkpoints NIC2 NIC2
Failure Detection in Wide Area Deployments INTERNET External Network WAN • Cannot distinguish between link and node failure. • Higher chances of Split Brainas the network is not reliable anymore LAN Replication Channel Primary Datacenter Backup Datacenter Primary Backup NIC1 NIC1 Checkpoints NIC2 NIC2
Failover Arbitration • Local Quorum of Simple Reachability Detectors. • Stewards can be placed on third party clouds. • Google App Server implementation with ~100 LoC. • Provider/User could have other sophisticated implementations.
Failover Arbitration.. Stewards 5 1 4 2 3 POLL 5 POLL 4 X POLL 3 POLL 2 X POLL 1 X POLL 4 POLL 2 POLL 3 X X POLL 5 POLL 1 Apriori Steward Set Agreement Backup Primary Quorum Logic Quorum Logic I need majority to stay alive I need exclusive majority to failover Replication Stream
Network Failover without Service Interruption • Remus – LAN - Gratuitous ARP from Backup Host • SecondSite – WAN/Internet – BGP Route Update from Backup Datacenter • Need support from upstream ISP(s) at both Datacenters • IP Migration achieved through BGP Multi-homing
Network Failover without Service Interruption.. Internet • BGP Multi-homing • Replication • Routing traffic to Primary Site • Re-routing traffic to Backup Site on Failover BCNet (AS-271) Kamloops (207.23.255.237) Vancouver (134.87.2.173) as-path prepend 64678 as-path prepend 64678 64678 64678 64678 as-path prepend 64678 64678 AS-64678 (stub) (134.87.3.0/24) 134.87.2.174 207.23.255.238 AS-64678 (stub) (134.87.3.0/24) VMs VMs VMs Primary Site Backup Site
Overview • A Case for Commoditizing Disaster Tolerance • SecondSite – System Design • Evaluation & Experiences
Evaluation Failover Works!! I want periodic failovers with no downtime! More than one failure ? I will have to restart HA! Did you run regression tests ?
Restarting HA • Need to Resynchronize Storage. • Avoiding Service Downtime requires Online Resynchronization • Leverage DRBD –only resynchronizes blocks that have changed • Integrate DRBD with Remus • Add checkpoint based asynchronous disk replication protocol.
Regression Tests • Synthetic Workloads to stress test the Replication Pipeline • Failovers every 90 minutes • Discovered some interesting corner cases • Page-table corruptions in memory checkpoints • Write-after-write I/O ordering in disk replication
SecondSite – The Complete Picture 4 VMs x 100 Clients/VM • Service Downtime includes timeout for failure detection (10s) • Failure Detection Timeout is configurable
Replication Bandwidth Consumption 4 VMs x 100 Clients/VM
Demo • Expect a real disaster (conference demos are not a good idea!)
Application Throughput vs. Replication Latency Kamloops SPECWeb w/ 100 Clients
Resource Utilization vs. Application Load Domain-0 CPU Utilization Bandwidth usage on Replication Channel Cost of HA as a function of Application Load (OLTP w/ 100 Clients)
Resynchronization Delays vs. Outage Period OLTP Workload
Setup Workflow – Recovery Site • The user creates a recovery plan which is associated to a single or multiple protection groups Source: VMWare Site Recovery Manager – Technical Overview
Recovery Plan VM Shutdown High Priority VM Shutdown Prepare Storage High Priority VM Recovery Normal Priority VM Recovery Low Priority VM Recovery Source: VMWare Site Recovery Manager – Technical Overview