370 likes | 525 Views
Lecture 14:Combating Outliers in MapReduce Clusters. Xiaowei Yang. References: Reining in the Outliers in Map-Reduce Clusters using Mantri by Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Edward Harris
E N D
Lecture 14:Combating Outliers in MapReduce Clusters Xiaowei Yang
References: • Reining in the Outliers in Map-Reduce Clusters using Mantri by Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Edward Harris • http://research.microsoft.com/en-us/UM/people/srikanth/data/Combating%20Outliers%20in%20Map-Reduce.web.pptx
log(size of cluster) 105 104 mapreduce 103 HPC, || databases 102 e.g., the Internet, click logs, bio/genomic data 101 1 log(size of dataset) GB 109 TB 1012 PB 1015 EB 1018 • MapReduce • Decouples customized data operations from mechanisms to scale • Is widely used • Cosmos (based on SVC’s Dryad) + Scope @ Bing • MapReduce @ Google • Hadoop inside Yahoo! and on Amazon’s Cloud (AWS)
An Example SELECT Query, COUNT(*) AS Freq FROMQueryTable HAVINGFreq > X What the user says: GoalFind frequent search queries to Bing How it Works: job manager assign work, get progress file block 0 task file block 1 output block 0 task Local write task file block 2 output block 1 task task file block 3 Reduce Read Map
We find that: Outliers slow down map-reduce jobs File System Map.Read 22K Map.Move 15K Map 13K Barrier Reduce 51K • Goals • Speeding up jobs improves productivity • Predictability supports SLAs • … while using resources efficiently
What is an outlier • A phase (map or reduce) has n tasks and s slots (available compute resources) • Every task takes T seconds to run • ti = f (datasize, code, machine, network) • Ideally run time = ceiling (n/s) * T • A naïve scheduler • Goal is to be closer to
From a phase to a job • A job may have many phases • An outlier in an early phase has a cumulative effect • Data loss may cause multi-phase recompute outliers
Why outliers? Problem: Due to unavailable input, tasks have to be recomputed map reduce Delay due to a recompute sort Delay due to a recompute readily cascades 8
Previous work • The original MapReduce paper observed the problem • But didn’t deal with it in depth • Solution was to duplicate the slow tasks • Drawbacks • Some may be unnecessary • Use extra resources • Placement may be the problem
Quantifying the Outlier Problem • Approach: • Understanding the problem first before proposing solutions • Understanding often leads to solutions • Prevalence of outliers • Causes of outliers • Impact of outliers
Why bother? Frequency of Outliers stragglers = Tasks that take 1.5 times the median task in that phase recomputes = Tasks that are re-run because their output was lost straggler straggler Outlier • 50% phases have 10% stragglers and no recomputes • 10% of the stragglers take >10X longer
Causes of outliners: data skew Duplicating will not help! • In 40% of the phases, all the tasks with high runtimes (>1.5x the median task) correspond to large amount of data over the network
Non-outliers can be improved as well • 20% of them are 55% longer than median
Problem: Tasks reading input over the network experience variable congestion Reduce task Map output uneven placement is typical in production • reduce tasks are placed at first available slot 14
Causes of outliers: cross rack traffic • 70% of cross track traffic is reduce traffic • Tasks in a spot with slow network run slower • Tasks compete network among themselves • Reduce reads from every map • Reduce is put into any spare slot 50% phases takes 62% longer to finish than ideal placement
Cause of outliers: bad and busy machines • 50% of recomputes happen on 5% of the machines • Recompute increases resource usage
Outliers cluster by time • Resource contention might be the cause • Recomputes cluster by machines • Data loss may cause multiple recomputes
Why bother? Cost of outliers (what-if analysis, replays logs in a trace driven simulator) At median, jobs slowed down by 35% due to outliers
High-level idea • Cause aware, and resource aware • Runtime = f (input, network, machine, datatoProcess, …) • Fix each problem with different strategies
Resource-aware restarts • Duplicate or kill long outliers
When to restart • Every ∆ seconds, tasks report progress • Estimate trem and tnew
γ= 3 • Schedule a duplicate if the total running time is smaller • P(c trem > (c+1) tnew) > δ • When there are available slots, restart if reduction time is more than restart time • E(trem – tnew ) > ρ ∆
Network Aware Placement • Compute the rack location for each task • Find the placement that minimizes the maximum data transfer time If rack i has di map output and ui,vi bandwidths available on uplink and downlink, Place ai fraction of reduces such that:
Avoid recomputation • Replicating the output • Restart a task if data are lost • Replicate the most costly job
Data-aware task ordering • Outliers due to large input • Schedule tasks in descending order of dataToProcess • At most 33% worse than optimal scheduling
Estimation of trem and tnew • d: input data size • dread: the amount read
Estimation of tnew • processRate: estimated of all tasks in the phase • locationFactor: machine performance • d: input size
Results Deployed in production cosmos clusters Prototype Jan’10 baking on pre-prod. clusters release May’10 Trace driven simulations thousands of jobs mimic workflow, task runtime, data skew, failure prob. compare with existing schemes and idealized oracles
Evaluation Methodology • Mantri run on production clusters • Baseline is results from Dryad • Use trace-driven simulations to compare with other systems
Comparing jobs in the wild 340 jobs that each repeated at least five times during May 25-28 (release) vs. Apr 1-30 (pre-release) • w/ and w/o Mantri for one month of jobs in Bing production cluster
In production, restarts… improve on native cosmos by 25% while using fewer resources
In trace-replay simulations, restarts… CDF % cluster resources CDF % cluster resources are much better dealt with in a cause-, resource- aware manner. Each job repeated thrice
Network-aware Placement • Equal: all links have the same bandwidth • Start: same as the start • Ideal: available bandwidth at run time
Protecting against recomputes CDF % cluster resources
Summary • Reduce recomputation: preferentially replicate costly-to-recompute tasks • Poor network: each job locally avoids network hot-spots • Bad machines: quarantine persistently faulty machines • DataToProcess: schedule in descending order of data size • Others: restart or duplicate tasks, cognizant of resource cost. Prune
Conclusion • Outliers in map-reduce clusters are a significant problem • happen due to many causes • interplay between storage, network and map-reduce • cause-, resource- aware mitigationimproves on prior art