Fault Tolerance in Cloud

Fault Tolerance in Cloud Srihari Murali sriharim@buffalo.edu University at Buffalo November 11, 2011

Overview • Context • Fault Tolerance Approaches • FREERIDE-G FTS • Experimental Evaluation of FREERIDE-G FTS against Hadoop • Case Study: Amazon EC2 Service Disruption

Setting the Context • Scale-out and not scale-up • Commodity hardware MTBF order of 1000 days • Large clusters • Long execution time high probability of some servers failing at a given time! • Need – graceful handling of machine failure with low overhead & efficient recovery

MapReduce • Worker Failure • Heartbeat pings • Master resets task of failed worker to initial idle state • Rescheduled on other workers • Master Failure • Periodic checkpoints • Copy started from last checkpointed state • HDFS: Tolerance via data replication

Fault Tolerance Approaches • Replicate Jobs • Multiple replicas do work on same data • Expensive (heavy resource utilization) • Replicate Input Data • Restart from another node with data replica • Used in Hadoop • Slowdown (when failure at later stages) • Checkpointing • System or application level snapshot • Large size of state information

Alternate Approach: Introduction • Motivation • Low overhead • Efficient recovery • Alternative API based on MapReduce (from Ohio State University - IPDPS10.pdf) • Programmer declared reduce object • Object stores state of computation (checkpoint) • Implemented on FREERIDE-G DIC middleware

Concept behind FREERIDE API • Reduction Object represents the intermediate state of execution • Application developer writes functions for • Local Reduction • Global Reduction

Example – Word Count W3 W2 W3 W3 W1 W2 W2 W1 W2 W3 W1 W2 W1 W1 W2 RObj(1) = 2 3 RObj(1) = 1 RObj(1) = 1 RObj(1) = 5 Global Reduction(+) Iterate, no sorting or grouping!

Remote Data Analysis • Co-locating resources gives best performance • But may not be always possible • Capability of data storage, cost, distributed data storage, separate services etc. • Data hosts and compute hosts are separated • Remote Data Analysis Data hosts, compute hosts and the users may all be at distinct locations. Typical in a cloud envt. • Separation helps support fault-tolerance => failure of a processing node does not imply unavailability of data • FREERIDE-G = FREERIDE middleware + remote data analysis capability

Glimpse at FREERIDE-G architecture

Architecture (contd.) FREERIDE-G middleware is modeled as a client-server system, where the compute node clients interact with both data host servers and a code repository server. Key Components • SRB(Storage Resource Broker) Data Host Server • Middleware to automate data retrieval, with SRB client in each compute node • SRB server version 3.4.2&G, PostgreSQL database & ODBC driver • Code Repository Web Server • Deployed on Apache Tomcat Web Server (v5.5) • Compute Node Client • Data & metadata retrieval • API Code retrieval • Data Analysis • Communication implemented by middleware using MPI

FTS – Based on RObj • Global Reduction fn is associative and commutative. • Considerany disjoint partition {E1, E2, …, En} of the processing elements between n nodes. Result of global reduction function is {RObj(E1), RObj(E2), …, RObj(En)} irrespective of the split. • Reduction object… • is small in size • is independent from machine architecture • represents intermediate state of the computation, and can be checkpointed with low overhead • No restart reqd.

Implementation Algorithm

Implementation • RObj and last processed chunk id as system snapshot – stored in another location (checkpoint). • Restart failed process at another node – process all chunks given to failed node with id > stored chunk id • Not just restart failed process, but can redistribute remaining chunks evenly across multiple processes - better load balancing! • Combine resulting RObj’s with cached RObjto obtain final result. • Configure RObj caching in system manager (group size).

Implementation (contd.) The Fault Tolerance Implementation involves three key components: • Configuration • Group Size • Exchange Frequency of RObj • Data Exchange method (syncvs.async) for ReplicateSnapshot • Number of connection trials (ntry) before declaring failure • Fault Detection • FREERIDE-G initialization of P2P • If node unreachable after ntry, declare as failed node • Fault Recovery • Invoke CheckFailure before global reduction • If failures are present, invoke RecoverFailure

Compute Node Processing with Fault Tolerance Support

Demonstration

Experimental Evaluation Goals • Observing RObj size • Evaluate overhead of FTS when no failure (with varying failure points) • Studying slowdown when a node fails • Comparison with HadoopImplementation of MapReduce

Experimental Setup • FREERIDE-G • Data hosts and compute nodes are separated. Only compute node failure studied (data hosts can be made redundant easily!) • Dual processor Opteron 254 (single core) with 4GB of RAM and connected through MellanoxInniband (1 Gb) • Applications (data-intensive) • K-means clustering 50 cluster centers, dataset sizes 6.4 GB and 25.6 GB, divided into 4K data blocks over 4 data hosting nodes • Principal Components Analysis (PCA) Dataset sizes 4 GB and 17 GB, divided again into 4K data blocks

Results (K-means Clustering) • Without Failure Configurations • Without FTS • With FTS (asyncobj exchange) • With Failure Configuration • Failure after processing %50 of data (on one node) • Without Failure Configurations • Without FTS • With FTS • With Failure Configuration • Relative slowdowns (18% for 4, 5.4% for 8 and 6.7% for 16 computer nodes) Execution Times with K-means 25.6 GB Dataset

Results (PCA Reduction) • Reduction obj. size: 128KB for • 4 Comp. Nodes, 4 GB dataset • With FTS, absolute overheads: • (3.9% for 4, 1.4% for 8 nodes) • Relative Overheads: • (21.2% for 4, 8.6% for 8, 7.8% for 16) Execution Times with PCA, 17 GB Dataset

Overheads of RecoveryDifferent Failure Points Failure points near beginning (25%), middle (50%) and close to end (75%). Relative slowdowns a) K-means 16.6%, 9%, 7.2% respectively b) PCA: 12.2%, 8.6%. 5.6% respectively Abs slowdowns: less than 5%

Comparison with Hadoop • w/f = with failure • Failure happens after processing 50% of the data on one node • Overheads (nodes 4 | 8 | 16) • Hadoop • 23.06 | 71.78 | 78.11 • FREERIDE-G • 20.37 | 8.18 | 9.18 K-means Clustering, 6.4GB Dataset

Comparison with Hadoop (contd.) • One of the comp. nodes failed after processing 25, 50 and 75% of its data • Overheads (nodes 4 | 8 | 16) • Hadoop • 32.85 | 71.21 | 109.45 • FREERIDE-G • 9.52 | 8.18 | 8.14 K-means Clustering, 6.4GB Dataset, 8 Comp. Nodes

Conclusions (FREERIDE-G approach) • Growing need for supporting fault tolerance in scale-out, commodity hardware clusters. • FREERIDE-G FTS has low overhead • Failure recovery process is efficient - only marginal slowdown occurs as remaining task is distributed. • System can outperform Hadoop both in absence and presence of failures. • Sorting, shuffling overhead eliminated • No need to restart failed task as in Hadoop • State caching • RObj approach offers flexibility for implementation of different designs to programmer using a simple high-level API.

Case Study Amazon EC2 Service Disruption (April 21st, 2011)

Amazon EC2 Key Concepts • Availability Zones • Distinct locations engineered for immunity against failures in other availability zones • Analogous to physical data centers (but may not actually be!) • Cheap, low latency network connectivity across zones in a region • User can protect applications from failure of a single location • Availability Regions • Geographically dispersed, consist of one or more availability zones • US East (Virginia), US East (N. California), EU West (Ireland), Asia (Singapore)etc.

Amazon EC2 (contd.) • Amazon Elastic Block Store (EBS) • Block level storage volumes from 1 GB to 1 TB that behave like raw, unformatted drives • Persistent storage for Amazon EC2 instances (auto replicated within the same Availability Zone) • Elastic IP Address • Option for static IP addresses associated with account instead of instance • Can mask instance or Availability Zone failures by programmatically remapping public IP addresses to any instance • Elastic Load Balancing • Automatically distributes incoming application traffic across multiple Amazon EC2 instances • Detects unhealthy instances within a pool and automatically reroutes traffic to healthy instances

A Simple Redundant Website • Elastic IP addresses assigned to 2 web servers running the web application, mapped to same domain name. • One master DB and a slave database in second availability zone; replicate data in real-time • Bandwidth across zone boundaries is not free. • 0.01/GB for “regional” traffic, for data in and data out.

Redundant site (contd.) • Suppose the zone with web servers and DB fails (US-EAST-1), say due to a fire! • Promote slave in second zone to master, launch new web/app server instances. • Create redundancy in a third availability zone.

EBS System – Fault Tolerance • Two main components of EBS Service: • Set of EBS clusters (each runs in an Availability Zone) to store user data and serve requests to EC2 instances • Set of control plane services to coordinate user requests and propagate them to EBS clusters, one set per region • EBS cluster made up of EBS nodes. • EBS data is replicated to multiples EBS nodes in a cluster.

EBS System (contd.) • EBS Nodes connected via: • Primary n/w: High bandwidth used in normal operation • Secondary n/w: Low capacity n/w for replication • Re-mirroring • When node loses connectivity to its replica, it assumes the replica node failed • Searches for another node with enough space, establishes connectivity and starts replicating (re-mirroring) • Finding a location for new replica (~ millisecs) • Access to data blocked while re-mirroring to preserve consistency • EC2 Instance doing I/O: volume appears stuck

Amazon EC2 Outage (April 21st, 2011) • Amazon experienced a large outage in US-East region that took down popular sites such as Foursquare, Quora, Reddit,Hootsuite and Flightcaster. • Amazon: "We can confirm connectivity errors impacting EC2 instances and increased latencies impacting EBS (Elastic Block Storage) volumes in multiple availability zones in the US-EAST-1 region"

Primary Outage 12:47 AM PDT on April 21st • Network configuration change was performed to upgrade capacity of primary n/w in a single Availability Zone. • Operation: Shift traffic off of the primary EBS n/w. Executed incorrectly! Traffic was routed onto the lower capacity EBS n/w. • Portion of EBS cluster in affected Availability Zone down as both primary and secondary n/w were affected at once! • EBS nodes lost connection to their replicas. • Incorrect traffic shift was rolled back. • Large number of nodes concurrently tried to get available server space for mirroring - a “re-mirroring storm” occurred, 13% of volumes in the Zone were “stuck” • EBS control plane – thread starvation - fail requests across the Region. • Consequence: more & more nodes became “stuck”.

Recovery Process • 8.20 AM PDT • Disable all communication between EBS cluster in affected availability zone and control plane. • Error rates and latencies returned to normal for rest of region. • 11.30 AM PDT • Prevent EBS servers in degraded cluster from contacting other servers for replication (make affected volumes less “hungry”) • 13% of the volume in Availability Zone were alone stuck, outage contained • 22nd 2:00 AM PDT • Need to provide additional capacity for the huge number (13%) of stuck volumes for mirroring • Physically relocate storage capacity from US East region for degraded cluster • Re-establish complete EBS control plane API access • Large backlog of requests • Try throttling state propagation to avoid crashing control plane again • Final strategy: Build a separate control plane for affected Zone, to avoid impacting other zones while processing backlog

Recovery (contd.) • 23rd 6.15 PM PDT • Traffic tests and steady processing of backlog • Full functionality restored in affected Availability Zone • Amazon: “0.07% of the volumes in the affected Availability Zone [that] could not be restored for customers in a consistent state” (not disclosed in GBs!)

Lessons Learnt • Better capacity planning and alarming • Audit change process, increase automation • Additional safety capacity for large scale failures • Modify retry logic in nodes • Prevent re-mirroring storm in clusters • Back off more aggressively • Focus on re-establishing connectivity with previous replicas than futile search for new nodes to re-mirror Lessons for user • User taking advantage of multiple Availability Zone architecture were least impacted (e.g. NetFlix)

Lessons Learnt (contd.) • Netflix Survival Strategies • Using different regions simultaneously helped give uninterrupted service • Netflix uses multiple redundant hot copies of data across zones. In case of failure, switch to a hot standby (N+1 redundancy with all zones active). • Chaos Monkey randomly kill instances & services, in future Chaos Gorilla (entire AZ down) • Graceful degradation Fail Fast: Set aggressive timeouts so failing components don’t make the entire system crawl. Fallbacks: Each feature is designed to degrade or fall back to a lower quality representation. Feature Removal: If a feature is non-critical then if it’s slow, remove from given page to prevent impact on overall experience.

References [1] Jimmy Lin and Chris Dyer - Data-Intensive Text Processing with MapReduce [2] Tekin Bicer, Wei Jiang and Gagan Agrawal - Supporting Fault Tolerance in a Data-Intensive Computing Middleware [3] Jeffrey Dean and Sanjay Ghemawat- MapReduce: Simplified Data Processing on Large Clusters [4] Web Links Amazon EC2 – http://aws.amazon.com/message/65648/ http://aws.amazon.com/ec2/faqs/ Net Flix – http://tinyurl.com/3bvyvef Fault Tolerant Site - http://tinyurl.com/2ybok5

Thank You

Fault Tolerance in Cloud

Fault Tolerance in Cloud

Presentation Transcript

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance