CARDIO: Cost-Aware Replication for Data-Intensive workflOws

CARDIO: Cost-Aware Replication for Data-Intensive workflOws Presented by Chen He

Motivation • Is large scale cluster reliable? • 5 average worker deaths per Map-Reduce job • At least 1 disk failure in every run of a 6- hour MapReduce job on a 4000-node cluster

Motivation • How to prevent node failure from affecting performance? • Replication • Capacity constraint • Replication time, etc • Regeneration through re-execution • Delay program progress • Cascaded re-execution

Motivation COST AVAILABILITY All pictures adopted from the Internet

Outline • Problem Exploration • CARDIO Model • Hadoop CARDIO System • Evaluation • Discussion

Problem Exploration • Performance Costs • Replication cost (R) • Regeneration cost (G) • Reliability cost (Z) • Execution cost (A) • Total cost (T) • Disk cost (Y) T=A+Z Z=R+G

Problem Exploration • Experiment Environment • Hadoop 0.20.2 • 25 VMs • Workloads: Tagger->Join->Grep->RecordCounter

Problem Exploration Summary • Replication Factor for MR Stages

Problem Exploration Summary • Detailed Execution Time of 3 Cases

CARDIO Model • Block Failure Model • Output of stage i is • Replication factor is • Total block number is • Single block failure probability is • Failure probability in stage i:

CARDIO Model • Cost Computation Model • Total time of stage i: • Replication cost of stage i: • Expected regeneration time of stage i: • Reliability cost for all stages: • Storage Constraint C of all stages: • Choose to minimize Z

CARDIO Model • Dynamic Replication • Replication number x may vary during the program approaching • Job is in Step k, the replication factor at this step is:

CARDIO Model • Model for Reliability • Minimize • Based on • In the condition of

CARDIO Model • Resource Utilization Model • Model Cost = resource utilized • Resource type Q • CPU, Network, Disk, and Storage resource, etc. • Utilization of q resource in stage i: • Normalize usage by • Relative costs weights:

CARDIO Model • Resource Utilization Model • The cost for A is: • Total Cost: • Optimization target: • Choose to minimize T

CARDIO Model • Optimization Problem • Job optimality (JO) • Stage optimality (SO)

Hadoop CARDIO System • CardioSense • Obtain progress from JT periodically • Be triggered by pre-configured threshold-value • Collect resource usage statistics for running stages • Rely on HMon on each worker node • HMon based on Atop has low overhead

Hadoop CARDIO System • CardioSolve • Receive data from CardioSense • Solve SO problem • Decide the replication factors for current and previous stages

Hadoop CARDIO System • CardioAct • Implement the command from CardioSolve • Use HDFS API setReplication(file, replicaNumber)

Hadoop CARDIO System

Evaluation • Several Important Parameters • p is the failure rate 0.2 if not specified • is the time to replicate a data unit, 0.2 as well • is the computation resource of stage i, it follows uniform distribution U(1,Cmax),Cmax=100 in general. • is the output of stage i, it is obtained from a uniform distribution U(1, Dmax), Dmax varies within the [1,Cmax]. • C is the storage constraint for the whole process. Default value is

Evaluation • Effect of Dmax

Evaluation • Effect of Failure rate p

Evaluation • Effect of block size

Evaluation • Effect of different resource constraints ++ means over-utilzed, and this type of resource is regarded as expensive P=0.08, C=204GB, delta=0.6 S3 is CPU intensive DSK has similar performance pattern as NET CPU 0010, NET 0011, DSKIO 0011,STG0011

Evaluation S2 re-execute more frequently due to the failure injection. Because it has large data output. P=0.02, 0.08 and 0.1 1 , 3, 21 API reason

Discussion • Problems • Typos and misleading symbols • HDFS API setReplication() • Any other ideas?

CARDIO: Cost-Aware Replication for Data-Intensive workflOws

CARDIO: Cost-Aware Replication for Data-Intensive workflOws

Presentation Transcript

DNA replication

A Compromised-Time-Cost Scheduling Algorithm in SwinDeW-C for Instance-Intensive Cost-Constrained Workflows on Cloud Co

On Replication

DNA Replication Section 12.2

DNA Replication

# load data originaldata = load_data_from_csv(rawdatafile) #filter out a range

DNA Replication

Replication for real-time warehousing

Data Intensive Computing

Biology 9.3 Replication of DNA

ERMS: Elastic Replication Management System for HDFS

Problems and Solutions with Adabas Replication and Mass Data

SharePoint 2010 Workflows

DNA Replication

Comp-02: Replication Options Explored

Chapter 6 Consistency and Replication

Quality-Aware Replication of Multimedia Data

Seaweed: Scalable Delay Aware Querying

Properties of Data Replication :

Workflow evolution provenance and OPM

D0 File Replication