200 likes | 346 Views
Predicting Execution Bottlenecks in Map-Reduce Clusters. Edward Bortnikov, Ari Frank, Eshcar Hillel, Sriram Rao Presenting: Alex Shraer Yahoo! Labs. The Map Reduce (MR) Paradigm. Architecture for scalable information processing Simple API Computation scales to Web-scale data collections
E N D
Predicting Execution Bottlenecks in Map-Reduce Clusters Edward Bortnikov, Ari Frank, Eshcar Hillel, Sriram Rao Presenting: Alex Shraer Yahoo! Labs
The Map Reduce (MR) Paradigm • Architecture for scalable information processing • Simple API • Computation scales to Web-scale data collections • Google MR • Pioneered the technology in early 2000’s • Hadoop: open-source implementation • In use at Amazon, eBay, Facebook, Yahoo!, … • Scales to 10K’s nodes (Hadoop 2.0) • Many proprietary implementations • MR technologies at Microsoft, Yandex, …
Computational Model Slowest task (straggler) affects the job latency M1 R1 M2 Input (on DFS) Output (on DFS) R2 M3 M4 Synchronousexecution: every R starts computing after all M’s have completed
Predicting Straggler Tasks • Straggler tasks are an inherent bottleneck • Affect job latency, and to some extend throughput • Two approaches to tackle stragglers • Avoidance–reduce the probability of straggler emergence • Detection – once a task goes astray, speculatively fire a duplicate task somewhere else • This work – straggler prediction • Fits with both avoidance and detection scenarios
Background • Detection, Speculative Execution • First implemented in Google MR (OSDI ’04) • Hadoop employs a crude detection heuristic • LATE scheduler (OSDI ‘08) addresses the issues of heterogeneous hardware. Evaluated on small scale. • Microsoft MR (Mantri project, OSDI ‘10) • Avoidance • Local/rack-local data access is preferred for mappers • … Network less likely to become the bottleneck • All optimizations are heuristic
Machine-Learned vs Heuristic Prediction • Heuristics are hard to … • Tune for real workloads • Catch transient bottlenecks • Some evidence from Hadoop grids at Yahoo! • Speculative scheduling is non-timely and wasteful • 90% of the fired duplicates are eventually killed • Data-local computation amplifies contention • Can we use the wealth of historical grid performance data to train a machine-learned bottleneck classifier?
Why Should Machine Learning Work? Huge recurrence of large jobs in production grids Target workload 95% of mappers and reducers belong to jobs that ran 50+ times in a 5-month sample
The Slowdown Metric • Task slowdown factor • Ratio between the task’s running time and the median running time among the sibling tasks in the same job. • Root causes • Data skew– input or output significantly exceeds the median for the job • Tasks with skew > 4x happen really seldom. • Hotspots – all the other reasons • Congested/misconfigured/degraded nodes, disks, or network. • Typically transient. The resulting slow can be very high.
Jobs with Mapper Slowdown > 5x Sample of ~50K jobs • 1% among all jobs • 5% among jobs with 1000 mappers or more • 40% due to data skew (2x or above), 60% due to hotspots
Jobs with Reducer Slowdown > 5x Sample of ~60K jobs • 5% among all jobs • 50% among jobs with 1000 reducers or more • 10% due to data skew (2x or above), 90% due to hotspots
Locality is No Silver Bullet Top contributor of straggler tasks over 6 hours • The same nodes are constantly lagging behind • Weaker CPUs (grid HW is heterogeneous), data hotspots, etc. • Pushing for locality too hard amplifies the problem!
Slowdown Predictor • An oracle plugin into a Map Reduce system • Input: node features + task features • Output: slowdown estimate • Features • M/R metrics (job- and task-level) • DFS metrics (datanode-level) • System metrics (host-level: CPU, RAM, disk I/O, JVM, …) • Network traffic (host-, rack- and cross-rack-level)
Slowdown Prediction - Mappers Mis-predicted, need improvement
Slowdown Prediction - Reducers More dispersed than the mappers
Some Conclusions • Data skew is the most important signal, but there are many more that are important • Node HW generation is a very significant signal for both mappers and reducers • Large grids undergo continuous HW upgrades • Network traffic features (intra-rack and cross-rack) is much more important for reducers than for mappers • How to collect efficiently in a real-time setting? • Need to enhance data sampling/weighting to capture outliers better
Takeaways • Slowdown prediction • ML approach to straggler avoidance and detection • Initial evaluation showed viability • Need to enhance training to capture outliers better • Challenge: runtime implementation • A good blend with the modern MR system architecture?
Machine Learning Technique • Gradient Boosted Decision Trees (GBDT) • Additive regression model • Based on ensemble of binary decision trees • 100 trees, 10 leaf nodes each …
Challenges – Hadoop Use Case • Hadoop 1.0 – centralized architecture • The single Job Tracker process manages all task assignment and scheduling • Full picture of Map and Reduce slots across the cluster • Hadoop 2.0 – distributed architecture • Resource management and scheduling functions split • Thin centralized Resource Manager (RM) creates application containers (e.g., for running a Map Reduce job) • Per-job App Master (AM) does scheduling within a container • May negotiate resource allocation with the RM • Challenge: working with a limited set of local signals
Possible Design – Hadoop 2.0 Centralized prediction will not scale. Will distributed prediction be accurate enough? New component or API Application Master Model App Container Creation Resource Manager Metrics collection (extends the existing HB protocol) Resource requests Node Manager Node Manager Node Manager • Some metrics already collected (CPU ticks, bytes R/W) • Others might be collected either by NM, or externally Metrics Metrics Metrics Job Execution Environment