1 / 37

Lecture 14:Combating Outliers in MapReduce Clusters

Lecture 14:Combating Outliers in MapReduce Clusters. Xiaowei Yang. References: Reining in the Outliers in Map-Reduce Clusters using Mantri by Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Edward Harris

humphrey
Download Presentation

Lecture 14:Combating Outliers in MapReduce Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 14:Combating Outliers in MapReduce Clusters Xiaowei Yang

  2. References: • Reining in the Outliers in Map-Reduce Clusters using Mantri by Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Edward Harris • http://research.microsoft.com/en-us/UM/people/srikanth/data/Combating%20Outliers%20in%20Map-Reduce.web.pptx

  3. log(size of cluster) 105 104 mapreduce 103 HPC, || databases 102 e.g., the Internet, click logs, bio/genomic data 101 1 log(size of dataset) GB 109 TB 1012 PB 1015 EB 1018 • MapReduce • Decouples customized data operations from mechanisms to scale • Is widely used • Cosmos (based on SVC’s Dryad) + Scope @ Bing • MapReduce @ Google • Hadoop inside Yahoo! and on Amazon’s Cloud (AWS)

  4. An Example SELECT Query, COUNT(*) AS Freq FROMQueryTable HAVINGFreq > X What the user says: GoalFind frequent search queries to Bing How it Works: job manager assign work, get progress file block 0 task file block 1 output block 0 task Local write task file block 2 output block 1 task task file block 3 Reduce Read Map

  5. We find that: Outliers slow down map-reduce jobs File System Map.Read 22K Map.Move 15K Map 13K Barrier Reduce 51K • Goals • Speeding up jobs improves productivity • Predictability supports SLAs • … while using resources efficiently

  6. What is an outlier • A phase (map or reduce) has n tasks and s slots (available compute resources) • Every task takes T seconds to run • ti = f (datasize, code, machine, network) • Ideally run time = ceiling (n/s) * T • A naïve scheduler • Goal is to be closer to

  7. From a phase to a job • A job may have many phases • An outlier in an early phase has a cumulative effect • Data loss may cause multi-phase recompute  outliers

  8. Why outliers? Problem: Due to unavailable input, tasks have to be recomputed map reduce Delay due to a recompute sort Delay due to a recompute readily cascades 8

  9. Previous work • The original MapReduce paper observed the problem • But didn’t deal with it in depth • Solution was to duplicate the slow tasks • Drawbacks • Some may be unnecessary • Use extra resources • Placement may be the problem

  10. Quantifying the Outlier Problem • Approach: • Understanding the problem first before proposing solutions • Understanding often leads to solutions • Prevalence of outliers • Causes of outliers • Impact of outliers

  11. Why bother? Frequency of Outliers stragglers = Tasks that take  1.5 times the median task in that phase recomputes = Tasks that are re-run because their output was lost straggler straggler Outlier • 50% phases have 10% stragglers and no recomputes • 10% of the stragglers take >10X longer

  12. Causes of outliners: data skew Duplicating will not help! • In 40% of the phases, all the tasks with high runtimes (>1.5x the median task) correspond to large amount of data over the network

  13. Non-outliers can be improved as well • 20% of them are 55% longer than median

  14. Problem: Tasks reading input over the network experience variable congestion Reduce task Map output uneven placement is typical in production • reduce tasks are placed at first available slot 14

  15. Causes of outliers: cross rack traffic • 70% of cross track traffic is reduce traffic • Tasks in a spot with slow network run slower • Tasks compete network among themselves • Reduce reads from every map • Reduce is put into any spare slot 50% phases takes 62% longer to finish than ideal placement

  16. Cause of outliers: bad and busy machines • 50% of recomputes happen on 5% of the machines • Recompute increases resource usage

  17. Outliers cluster by time • Resource contention might be the cause • Recomputes cluster by machines • Data loss may cause multiple recomputes

  18. Why bother? Cost of outliers (what-if analysis, replays logs in a trace driven simulator) At median, jobs slowed down by 35% due to outliers

  19. Mantri Design

  20. High-level idea • Cause aware, and resource aware • Runtime = f (input, network, machine, datatoProcess, …) • Fix each problem with different strategies

  21. Resource-aware restarts • Duplicate or kill long outliers

  22. When to restart • Every ∆ seconds, tasks report progress • Estimate trem and tnew

  23. γ= 3 • Schedule a duplicate if the total running time is smaller • P(c trem > (c+1) tnew) > δ • When there are available slots, restart if reduction time is more than restart time • E(trem – tnew ) > ρ ∆

  24. Network Aware Placement • Compute the rack location for each task • Find the placement that minimizes the maximum data transfer time If rack i has di map output and ui,vi bandwidths available on uplink and downlink, Place ai fraction of reduces such that:

  25. Avoid recomputation • Replicating the output • Restart a task if data are lost • Replicate the most costly job

  26. Data-aware task ordering • Outliers due to large input • Schedule tasks in descending order of dataToProcess • At most 33% worse than optimal scheduling

  27. Estimation of trem and tnew • d: input data size • dread: the amount read

  28. Estimation of tnew • processRate: estimated of all tasks in the phase • locationFactor: machine performance • d: input size

  29. Results Deployed in production cosmos clusters Prototype Jan’10 baking on pre-prod. clusters  release May’10 Trace driven simulations thousands of jobs mimic workflow, task runtime, data skew, failure prob. compare with existing schemes and idealized oracles

  30. Evaluation Methodology • Mantri run on production clusters • Baseline is results from Dryad • Use trace-driven simulations to compare with other systems

  31. Comparing jobs in the wild 340 jobs that each repeated at least five times during May 25-28 (release) vs. Apr 1-30 (pre-release) • w/ and w/o Mantri for one month of jobs in Bing production cluster

  32. In production, restarts… improve on native cosmos by 25% while using fewer resources

  33. In trace-replay simulations, restarts… CDF % cluster resources CDF % cluster resources are much better dealt with in a cause-, resource- aware manner. Each job repeated thrice

  34. Network-aware Placement • Equal: all links have the same bandwidth • Start: same as the start • Ideal: available bandwidth at run time

  35. Protecting against recomputes CDF % cluster resources

  36. Summary • Reduce recomputation: preferentially replicate costly-to-recompute tasks • Poor network: each job locally avoids network hot-spots • Bad machines: quarantine persistently faulty machines • DataToProcess: schedule in descending order of data size • Others: restart or duplicate tasks, cognizant of resource cost. Prune

  37. Conclusion • Outliers in map-reduce clusters are a significant problem • happen due to many causes • interplay between storage, network and map-reduce • cause-, resource- aware mitigationimproves on prior art

More Related