SecondSite: Disaster Tolerance as a Service

SecondSite: Disaster Tolerance as a Service Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

Failures in a Datacenter

Tolerating Failures in a Datacenter REMUS Initial idea behind Remus was to tolerate Datacenter level failures.

Can A Whole Datacenter Fail ? Yes! It’s a “Disaster”!

Disasters “Truck driver in Texas kills all the websites you really use” …Southlake FD found that he had low blood sugar - valleywag.com Illustrative Image courtesy of TangoPango, Flickr. “Our Internet infrastructure, despite all the talk, is as fragile as a fine porcelain cup on the roof of a car zipping across a pot-holed goat track. A single truck driver can take out sites like 37Signals in a snap.” - Om Malik, GigaOM

Disasters.. Water-main break cripples Dallas County computers, operations The county's criminal justice system nearly ground to a halt, as paper processing from another era led to lengthy delays - keeping some prisoners in jail longer than normal. - Dallas Morning News, Jun 2010

Disasters..

More Fodder Back Home • “ • An explosion… near our server bank … electrical box containing 580 fiber cables. • electrical box … was covered in asbestos… mandated the wearing of hazmat suits .... • Worse yet, the dynamic rerouting —which is the hallmark of the internet … did not function. • In other words, the perfect storm. Oh well. S*it happens. ’’ • -Dan Empfield, Slowswitch.com - a Gossamer Threads customer.

Disaster Recovery – The Old Fashioned Way • Storage replication between a primary and backup site. • Manually restore physical servers from backup images. • Data Loss and Long Outage periods. • Expensive Hardware – Storage Arrays, Replicators, etc.

Array Replication State of the Art Disaster Recovery X ProtectedSite RecoverySite Site RecoveryManager Site RecoveryManager VirtualCenter VirtualCenter VMs offline VMs online in Protected Site VMs powered on VMs become unavailable Datastore Groups Datastore Groups Source: VMWare Site Recovery Manager – Technical Overview

Problems with Existing Solutions • Data Loss & Service Disruption • (RPO ~15min, RTO ~few hours) • Complicated Recovery Planning • (e.g. service A needs to be up before B, etc.) • Application Level Recovery • Bottom Line: Current State of DR is • Complicated • Expensive • Not suitable for a general purpose cloud-level offering.

Disaster Tolerance as a Service ? Our Vision

Overview • A Case for Commoditizing Disaster Tolerance • SecondSite – System Design • Evaluation & Experiences

Primary & Backup Sites 5ms RTT

Failover & Failback without Outage Primary Site: Vancouver Backup Site : Kamloops Primary Site: Vancouver Primary Site: Kamloops Primary Site: Kamloops Backup Site : Vancouver • Complete State Recovery (CPU, disk, memory, network) • No Application Level Recovery

Main Contributions • Remus (NSDI ’08) • Checkpoint based State Replication • Fully Transparent HA • Recovery Consistency • No Application level recovery • RemusDB (VLDB’11) • Optimize Server Latency • Reduce Replication Bandwidth by up to 80% using • Page Delta Compression • Disk Read Tracking • SecondSite (VEE’12) • Failover Arbitration in Wide Area • Stateful Network Failover over Wide Area

Contributions..

Failure Detection in Remus External Network • A pair of independent dedicated NICs carry replication traffic. • Backup declares Primary failure only if • It cannot reach Primary via NIC 1 and NIC2 • It can reach External N/W via NIC1 • Failure of Replication link alone results in Backup shutdown. • Split Brain occurs only when both NICs/links fail. LAN Primary Backup NIC1 NIC1 Checkpoints NIC2 NIC2

Failure Detection in Wide Area Deployments INTERNET External Network WAN • Cannot distinguish between link and node failure. • Higher chances of Split Brainas the network is not reliable anymore LAN Replication Channel Primary Datacenter Backup Datacenter Primary Backup NIC1 NIC1 Checkpoints NIC2 NIC2

Failover Arbitration • Local Quorum of Simple Reachability Detectors. • Stewards can be placed on third party clouds. • Google App Server implementation with ~100 LoC. • Provider/User could have other sophisticated implementations.

Failover Arbitration.. Stewards 5 1 4 2 3 POLL 5 POLL 4 X POLL 3 POLL 2 X POLL 1 X POLL 4 POLL 2 POLL 3 X X POLL 5 POLL 1 Apriori Steward Set Agreement Backup Primary Quorum Logic Quorum Logic I need majority to stay alive I need exclusive majority to failover Replication Stream

Network Failover without Service Interruption • Remus – LAN - Gratuitous ARP from Backup Host • SecondSite – WAN/Internet – BGP Route Update from Backup Datacenter • Need support from upstream ISP(s) at both Datacenters • IP Migration achieved through BGP Multi-homing

Network Failover without Service Interruption.. Internet • BGP Multi-homing • Replication • Routing traffic to Primary Site • Re-routing traffic to Backup Site on Failover BCNet (AS-271) Kamloops (207.23.255.237) Vancouver (134.87.2.173) as-path prepend 64678 as-path prepend 64678 64678 64678 64678 as-path prepend 64678 64678 AS-64678 (stub) (134.87.3.0/24) 134.87.2.174 207.23.255.238 AS-64678 (stub) (134.87.3.0/24) VMs VMs VMs Primary Site Backup Site

Overview • A Case for Commoditizing Disaster Tolerance • SecondSite – System Design • Evaluation & Experiences

Evaluation Failover Works!! I want periodic failovers with no downtime! More than one failure ? I will have to restart HA! Did you run regression tests ?

Restarting HA • Need to Resynchronize Storage. • Avoiding Service Downtime requires Online Resynchronization • Leverage DRBD –only resynchronizes blocks that have changed • Integrate DRBD with Remus • Add checkpoint based asynchronous disk replication protocol.

Regression Tests • Synthetic Workloads to stress test the Replication Pipeline • Failovers every 90 minutes • Discovered some interesting corner cases • Page-table corruptions in memory checkpoints • Write-after-write I/O ordering in disk replication

SecondSite – The Complete Picture 4 VMs x 100 Clients/VM • Service Downtime includes timeout for failure detection (10s) • Failure Detection Timeout is configurable

Replication Bandwidth Consumption 4 VMs x 100 Clients/VM

Demo • Expect a real disaster (conference demos are not a good idea!)

Application Throughput vs. Replication Latency Kamloops SPECWeb w/ 100 Clients

Resource Utilization vs. Application Load Domain-0 CPU Utilization Bandwidth usage on Replication Channel Cost of HA as a function of Application Load (OLTP w/ 100 Clients)

Resynchronization Delays vs. Outage Period OLTP Workload

Setup Workflow – Recovery Site • The user creates a recovery plan which is associated to a single or multiple protection groups Source: VMWare Site Recovery Manager – Technical Overview

Recovery Plan VM Shutdown High Priority VM Shutdown Prepare Storage High Priority VM Recovery Normal Priority VM Recovery Low Priority VM Recovery Source: VMWare Site Recovery Manager – Technical Overview

SecondSite: Disaster Tolerance as a Service

SecondSite: Disaster Tolerance as a Service

Presentation Transcript

Making homes safe and disaster proof through appropriate disaster resistant techniques

Peripheral tolerance and Immunoregulation.

Technological Catastrophes

Hospital Disaster Preparedness

FLIXBOROUGH DISASTER

SPECIAL NEEDS POPULATIONS IN DISASTER RESPONSE

Emergency/Disaster Preparedness

The Next Disaster: Are You As A Nurse Prepared?

Consistency

Fault tolerance

POST-NATURAL DISASTER CONSTRUCTION SAFETY

Session 10 Disaster Mitigation

Disaster Ready…or Not? Stan Szpytek, AzHCA Consultant Life Safety / Disaster Planning

Disaster - Sudden or great misfortune

Engineering and Disaster

Are You Ready? Welcome

Disaster-Tolerant OpenVMS Clusters Keith Parris

Tolerance in Islam A Textual and Historical Analysis

Disaster Debris Management

Figure 7.1 Lowpass filter tolerance scheme.

Business Continuity and Disaster Recovery Planning

4-hour Disaster Orientation