240 likes | 403 Views
Combating Outliers in map-reduce. Srikanth Kandula Ganesh Ananthanarayanan , Albert Greenberg, Ion Stoica , Yi Lu, Bikas Saha , Ed Harris . . . l og(size of cluster). 10 5. 10 4. mapreduce. 10 3. HPC, || databases. 10 2. e.g., the Internet, click logs, bio/genomic data.
E N D
Combating Outliers in map-reduce Srikanth Kandula Ganesh Ananthanarayanan, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Ed Harris
log(size of cluster) 105 104 mapreduce 103 HPC, || databases 102 e.g., the Internet,click logs, bio/genomic data 101 1 log(size of dataset) GB 109 TB 1012 PB 1015 EB 1018 • map-reduce • decouples operations on data (user-code) from mechanisms to scale • is widely used • Cosmos (based on SVC’s Dryad) + Scope @ Bing • MapReduce @ Google • Hadoop inside Yahoo! and on Amazon’s Cloud (AWS)
An Example What the user says: SELECT Query, COUNT(*) AS Freq FROMQueryTable HAVINGFreq > X GoalFind frequent search queries to Bing How it Works: job manager assign work, get progress file block 0 task file block 1 output block 0 task Local write task file block 2 output block 1 task task file block 3 Reduce Read Map
We find that: Outliers slow down map-reduce jobs File System Map.Read 22K Map.Move 15K Map 13K Barrier Reduce 51K • Goals • speeding up jobs improves productivity • predictability supports SLAs • … while using resources efficiently
This talk… Identify fundamental causes of outliers • concurrency leads to contention for resources • heterogeneity (e.g., disk loss rate) • map-reduce artifacts Current schemes duplicate long-running tasks Mantri: A cause-, resource-aware mitigation scheme • takes distinct actions based on cause • considers resource cost of actions Resultsfrom a production deployment
Why bother? Frequency of Outliers stragglers = Tasks that take 1.5 times the median task in that phase recomputes = Tasks that are re-run because their output was lost straggler straggler Outlier • The median phase has 10% stragglers and no recomputes • 10% of the stragglers take >10X longer
Why bother? Cost of outliers (what-if analysis, replays logs in a trace driven simulator) At median, jobs slowed down by 35% due to outliers
Why outliers? Problem: Due to unavailable input, tasks have to be recomputed map reduce Delay due to a recompute Delay due to a recompute readily cascades sort
Why outliers? Problem: Due to unavailable input, tasks have to be recomputed (simple) Idea: Replicate intermediate data, use copy if original is unavailable Challenge(s) What data to replicate? Where? What if we still miss data? • Insights: • 50% of the recomputes are on 5% of machines
Why outliers? Problem: Due to unavailable input, tasks have to be recomputed (simple) Idea: Replicate intermediate data, use copy if original is unavailable Challenge(s) What data to replicate? Where? What if we still miss data? • Insights: • 50% of the recomputes are on 5% of machines • cost to recompute vs. cost to replicate t = predicted runtime of task r = predicted probability of recompute at machine trep = cost to copy data over within rack M1 M2 tredo = r2(t2 +t1redo) Mantripreferentially acts on the more costly recomputes
Why outliers? Problem: Tasks reading input over the network experience variable congestion Reduce task Map output uneven placement is typical in production • reduce tasks are placed at first available slot
Why outliers? Problem: Tasks reading input over the network experience variable congestion Idea: Avoid hot-spots, keep traffic on a link proportional to bandwidth Challenge(s) Global co-ordination across jobs? Where is the congestion? • Insights: • local control is a good approximation (each job balances its traffic) • link utilizations average out on the long term and are steady on the short term If rack i has di map output and ui,vi bandwidths available on uplink and downlink, Place ai fraction of reduces such that:
Why outliers? Persistently slow machines rarely cause outliers Cluster Software (Autopilot) quarantines persistently faulty machines
Why outliers? Problem: About 25% of outliers occur due to more dataToProcess Solution: Ignoring these is better than the state-of-the-art! (duplicating) In an ideal world, we could divide work evenly… We schedule tasks in descending order of dataToProcess Theorem [due to Graham, 1969] Doing so is no more than 33% worse than the optimal
Why outliers? Problem: 25% outliers remain,likely due to contention@machine Idea: Restart tasks elsewhere in the cluster Challenge(s) The earlier the better, but to restart outlier or start a pending task? trem Save time and resources iff Running task (a) Potential restart (tnew) (b) (c) time now If predicted time is much better, kill original, restart elsewhere Else, if other tasks are pending, duplicate iff save both time and resource Else, (no pending work) duplicate iff expected savings are high Continuously, observe and kill wasteful copies
Summary (a) (b) (c) (d) (e) preferentially replicate costly-to-recompute tasks each job locally avoids network hot-spots quarantine persistently faulty machines schedule in descending order of data size restart or duplicate tasks, cognoscent of resource cost. Prune. Theme: Cause-, Resource- aware action Explicit attempt to decouple solutions, partial success
Results Deployed in production cosmos clusters Prototype Jan’10 baking on pre-prod. clusters release May’10 Trace driven simulations thousands of jobs mimic workflow, task runtime, data skew, failure prob. compare with existing schemes and idealized oracles
In production, restarts… improve on native cosmos by 25% while using fewer resources
Comparing jobs in the wild 340 jobs that each repeated at least five times during May 25-28 (release) vs. Apr 1-30 (pre-release) CDF % cluster resources CDF % cluster resources
In trace-replay simulations, restarts… CDF % cluster resources CDF % cluster resources are much better dealt with in a cause-, resource- aware manner
Protecting against recomputes CDF % cluster resources
Outliers in map-reduce clusters • are a significant problem • happen due to many causes • interplay between storage, network and map-reduce • cause-, resource- aware mitigation improves on prior art