610 likes | 1.03k Views
Fault Tolerance in Cloud. Srihari Murali sriharim@buffalo.edu University at Buffalo November 11, 2011. Overview. Context Fault Tolerance Approaches FREERIDE-G FTS Experimental Evaluation of FREERIDE-G FTS against Hadoop Case Study: Amazon EC2 Service Disruption. Setting the Context.
E N D
Fault Tolerance in Cloud Srihari Murali sriharim@buffalo.edu University at Buffalo November 11, 2011
Overview • Context • Fault Tolerance Approaches • FREERIDE-G FTS • Experimental Evaluation of FREERIDE-G FTS against Hadoop • Case Study: Amazon EC2 Service Disruption
Setting the Context • Scale-out and not scale-up • Commodity hardware MTBF order of 1000 days • Large clusters • Long execution time high probability of some servers failing at a given time! • Need – graceful handling of machine failure with low overhead & efficient recovery
MapReduce • Worker Failure • Heartbeat pings • Master resets task of failed worker to initial idle state • Rescheduled on other workers • Master Failure • Periodic checkpoints • Copy started from last checkpointed state • HDFS: Tolerance via data replication
Fault Tolerance Approaches • Replicate Jobs • Multiple replicas do work on same data • Expensive (heavy resource utilization) • Replicate Input Data • Restart from another node with data replica • Used in Hadoop • Slowdown (when failure at later stages) • Checkpointing • System or application level snapshot • Large size of state information
Alternate Approach: Introduction • Motivation • Low overhead • Efficient recovery • Alternative API based on MapReduce (from Ohio State University - IPDPS10.pdf) • Programmer declared reduce object • Object stores state of computation (checkpoint) • Implemented on FREERIDE-G DIC middleware
Concept behind FREERIDE API • Reduction Object represents the intermediate state of execution • Application developer writes functions for • Local Reduction • Global Reduction
Example – Word Count W3 W2 W3 W3 W1 W2 W2 W1 W2 W3 W1 W2 W1 W1 W2 RObj(1) = 2 3 RObj(1) = 1 RObj(1) = 1 RObj(1) = 5 Global Reduction(+) Iterate, no sorting or grouping!
Remote Data Analysis • Co-locating resources gives best performance • But may not be always possible • Capability of data storage, cost, distributed data storage, separate services etc. • Data hosts and compute hosts are separated • Remote Data Analysis Data hosts, compute hosts and the users may all be at distinct locations. Typical in a cloud envt. • Separation helps support fault-tolerance => failure of a processing node does not imply unavailability of data • FREERIDE-G = FREERIDE middleware + remote data analysis capability
Architecture (contd.) FREERIDE-G middleware is modeled as a client-server system, where the compute node clients interact with both data host servers and a code repository server. Key Components • SRB(Storage Resource Broker) Data Host Server • Middleware to automate data retrieval, with SRB client in each compute node • SRB server version 3.4.2&G, PostgreSQL database & ODBC driver • Code Repository Web Server • Deployed on Apache Tomcat Web Server (v5.5) • Compute Node Client • Data & metadata retrieval • API Code retrieval • Data Analysis • Communication implemented by middleware using MPI
FTS – Based on RObj • Global Reduction fn is associative and commutative. • Considerany disjoint partition {E1, E2, …, En} of the processing elements between n nodes. Result of global reduction function is {RObj(E1), RObj(E2), …, RObj(En)} irrespective of the split. • Reduction object… • is small in size • is independent from machine architecture • represents intermediate state of the computation, and can be checkpointed with low overhead • No restart reqd.
Implementation • RObj and last processed chunk id as system snapshot – stored in another location (checkpoint). • Restart failed process at another node – process all chunks given to failed node with id > stored chunk id • Not just restart failed process, but can redistribute remaining chunks evenly across multiple processes - better load balancing! • Combine resulting RObj’s with cached RObjto obtain final result. • Configure RObj caching in system manager (group size).
Implementation (contd.) The Fault Tolerance Implementation involves three key components: • Configuration • Group Size • Exchange Frequency of RObj • Data Exchange method (syncvs.async) for ReplicateSnapshot • Number of connection trials (ntry) before declaring failure • Fault Detection • FREERIDE-G initialization of P2P • If node unreachable after ntry, declare as failed node • Fault Recovery • Invoke CheckFailure before global reduction • If failures are present, invoke RecoverFailure
Experimental Evaluation Goals • Observing RObj size • Evaluate overhead of FTS when no failure (with varying failure points) • Studying slowdown when a node fails • Comparison with HadoopImplementation of MapReduce
Experimental Setup • FREERIDE-G • Data hosts and compute nodes are separated. Only compute node failure studied (data hosts can be made redundant easily!) • Dual processor Opteron 254 (single core) with 4GB of RAM and connected through MellanoxInniband (1 Gb) • Applications (data-intensive) • K-means clustering 50 cluster centers, dataset sizes 6.4 GB and 25.6 GB, divided into 4K data blocks over 4 data hosting nodes • Principal Components Analysis (PCA) Dataset sizes 4 GB and 17 GB, divided again into 4K data blocks
Results (K-means Clustering) • Without Failure Configurations • Without FTS • With FTS (asyncobj exchange) • With Failure Configuration • Failure after processing %50 of data (on one node) • Without Failure Configurations • Without FTS • With FTS • With Failure Configuration • Relative slowdowns (18% for 4, 5.4% for 8 and 6.7% for 16 computer nodes) Execution Times with K-means 25.6 GB Dataset
Results (PCA Reduction) • Reduction obj. size: 128KB for • 4 Comp. Nodes, 4 GB dataset • With FTS, absolute overheads: • (3.9% for 4, 1.4% for 8 nodes) • Relative Overheads: • (21.2% for 4, 8.6% for 8, 7.8% for 16) Execution Times with PCA, 17 GB Dataset
Overheads of RecoveryDifferent Failure Points Failure points near beginning (25%), middle (50%) and close to end (75%). Relative slowdowns a) K-means 16.6%, 9%, 7.2% respectively b) PCA: 12.2%, 8.6%. 5.6% respectively Abs slowdowns: less than 5%
Comparison with Hadoop • w/f = with failure • Failure happens after processing 50% of the data on one node • Overheads (nodes 4 | 8 | 16) • Hadoop • 23.06 | 71.78 | 78.11 • FREERIDE-G • 20.37 | 8.18 | 9.18 K-means Clustering, 6.4GB Dataset
Comparison with Hadoop (contd.) • One of the comp. nodes failed after processing 25, 50 and 75% of its data • Overheads (nodes 4 | 8 | 16) • Hadoop • 32.85 | 71.21 | 109.45 • FREERIDE-G • 9.52 | 8.18 | 8.14 K-means Clustering, 6.4GB Dataset, 8 Comp. Nodes
Conclusions (FREERIDE-G approach) • Growing need for supporting fault tolerance in scale-out, commodity hardware clusters. • FREERIDE-G FTS has low overhead • Failure recovery process is efficient - only marginal slowdown occurs as remaining task is distributed. • System can outperform Hadoop both in absence and presence of failures. • Sorting, shuffling overhead eliminated • No need to restart failed task as in Hadoop • State caching • RObj approach offers flexibility for implementation of different designs to programmer using a simple high-level API.
Case Study Amazon EC2 Service Disruption (April 21st, 2011)
Amazon EC2 Key Concepts • Availability Zones • Distinct locations engineered for immunity against failures in other availability zones • Analogous to physical data centers (but may not actually be!) • Cheap, low latency network connectivity across zones in a region • User can protect applications from failure of a single location • Availability Regions • Geographically dispersed, consist of one or more availability zones • US East (Virginia), US East (N. California), EU West (Ireland), Asia (Singapore)etc.
Amazon EC2 (contd.) • Amazon Elastic Block Store (EBS) • Block level storage volumes from 1 GB to 1 TB that behave like raw, unformatted drives • Persistent storage for Amazon EC2 instances (auto replicated within the same Availability Zone) • Elastic IP Address • Option for static IP addresses associated with account instead of instance • Can mask instance or Availability Zone failures by programmatically remapping public IP addresses to any instance • Elastic Load Balancing • Automatically distributes incoming application traffic across multiple Amazon EC2 instances • Detects unhealthy instances within a pool and automatically reroutes traffic to healthy instances
A Simple Redundant Website • Elastic IP addresses assigned to 2 web servers running the web application, mapped to same domain name. • One master DB and a slave database in second availability zone; replicate data in real-time • Bandwidth across zone boundaries is not free. • 0.01/GB for “regional” traffic, for data in and data out.
Redundant site (contd.) • Suppose the zone with web servers and DB fails (US-EAST-1), say due to a fire! • Promote slave in second zone to master, launch new web/app server instances. • Create redundancy in a third availability zone.
EBS System – Fault Tolerance • Two main components of EBS Service: • Set of EBS clusters (each runs in an Availability Zone) to store user data and serve requests to EC2 instances • Set of control plane services to coordinate user requests and propagate them to EBS clusters, one set per region • EBS cluster made up of EBS nodes. • EBS data is replicated to multiples EBS nodes in a cluster.
EBS System (contd.) • EBS Nodes connected via: • Primary n/w: High bandwidth used in normal operation • Secondary n/w: Low capacity n/w for replication • Re-mirroring • When node loses connectivity to its replica, it assumes the replica node failed • Searches for another node with enough space, establishes connectivity and starts replicating (re-mirroring) • Finding a location for new replica (~ millisecs) • Access to data blocked while re-mirroring to preserve consistency • EC2 Instance doing I/O: volume appears stuck
Amazon EC2 Outage (April 21st, 2011) • Amazon experienced a large outage in US-East region that took down popular sites such as Foursquare, Quora, Reddit,Hootsuite and Flightcaster. • Amazon: "We can confirm connectivity errors impacting EC2 instances and increased latencies impacting EBS (Elastic Block Storage) volumes in multiple availability zones in the US-EAST-1 region"
Primary Outage 12:47 AM PDT on April 21st • Network configuration change was performed to upgrade capacity of primary n/w in a single Availability Zone. • Operation: Shift traffic off of the primary EBS n/w. Executed incorrectly! Traffic was routed onto the lower capacity EBS n/w. • Portion of EBS cluster in affected Availability Zone down as both primary and secondary n/w were affected at once! • EBS nodes lost connection to their replicas. • Incorrect traffic shift was rolled back. • Large number of nodes concurrently tried to get available server space for mirroring - a “re-mirroring storm” occurred, 13% of volumes in the Zone were “stuck” • EBS control plane – thread starvation - fail requests across the Region. • Consequence: more & more nodes became “stuck”.
Recovery Process • 8.20 AM PDT • Disable all communication between EBS cluster in affected availability zone and control plane. • Error rates and latencies returned to normal for rest of region. • 11.30 AM PDT • Prevent EBS servers in degraded cluster from contacting other servers for replication (make affected volumes less “hungry”) • 13% of the volume in Availability Zone were alone stuck, outage contained • 22nd 2:00 AM PDT • Need to provide additional capacity for the huge number (13%) of stuck volumes for mirroring • Physically relocate storage capacity from US East region for degraded cluster • Re-establish complete EBS control plane API access • Large backlog of requests • Try throttling state propagation to avoid crashing control plane again • Final strategy: Build a separate control plane for affected Zone, to avoid impacting other zones while processing backlog
Recovery (contd.) • 23rd 6.15 PM PDT • Traffic tests and steady processing of backlog • Full functionality restored in affected Availability Zone • Amazon: “0.07% of the volumes in the affected Availability Zone [that] could not be restored for customers in a consistent state” (not disclosed in GBs!)
Lessons Learnt • Better capacity planning and alarming • Audit change process, increase automation • Additional safety capacity for large scale failures • Modify retry logic in nodes • Prevent re-mirroring storm in clusters • Back off more aggressively • Focus on re-establishing connectivity with previous replicas than futile search for new nodes to re-mirror Lessons for user • User taking advantage of multiple Availability Zone architecture were least impacted (e.g. NetFlix)
Lessons Learnt (contd.) • Netflix Survival Strategies • Using different regions simultaneously helped give uninterrupted service • Netflix uses multiple redundant hot copies of data across zones. In case of failure, switch to a hot standby (N+1 redundancy with all zones active). • Chaos Monkey randomly kill instances & services, in future Chaos Gorilla (entire AZ down) • Graceful degradation Fail Fast: Set aggressive timeouts so failing components don’t make the entire system crawl. Fallbacks: Each feature is designed to degrade or fall back to a lower quality representation. Feature Removal: If a feature is non-critical then if it’s slow, remove from given page to prevent impact on overall experience.
References [1] Jimmy Lin and Chris Dyer - Data-Intensive Text Processing with MapReduce [2] Tekin Bicer, Wei Jiang and Gagan Agrawal - Supporting Fault Tolerance in a Data-Intensive Computing Middleware [3] Jeffrey Dean and Sanjay Ghemawat- MapReduce: Simplified Data Processing on Large Clusters [4] Web Links Amazon EC2 – http://aws.amazon.com/message/65648/ http://aws.amazon.com/ec2/faqs/ Net Flix – http://tinyurl.com/3bvyvef Fault Tolerant Site - http://tinyurl.com/2ybok5