1 / 44

Matei Zaharia , Dhruba Borthakur * , Joydeep Sen Sarma * ,

Fair Scheduling for MapReduce Clusters. Matei Zaharia , Dhruba Borthakur * , Joydeep Sen Sarma * , Khaled Elmeleegy + , Scott Shenker , Ion Stoica. UC Berkeley, * Facebook Inc, + Yahoo! Research. UC Berkeley. Motivation.

Download Presentation

Matei Zaharia , Dhruba Borthakur * , Joydeep Sen Sarma * ,

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fair Scheduling for MapReduce Clusters Matei Zaharia, DhrubaBorthakur*, JoydeepSenSarma*, KhaledElmeleegy+, Scott Shenker, Ion Stoica UC Berkeley, *Facebook Inc, +Yahoo! Research UC Berkeley

  2. Motivation • MapReduce / Hadoop originally designed for high throughput batch processing • Today’s workload is far more diverse: • Many users want to share a cluster • Engineering, marketing, business intelligence, etc • Vast majority of jobs are short • Ad-hoc queries, sampling, periodic reports • Response time is critical • Interactive queries, deadline-driven reports • How can we efficiently share MapReduce clusters between users?

  3. Example: Hadoop at Facebook • 2000-node, 20 PB data warehouse, growing at 40 TB/day • Applications: data mining, spam detection, ads • 200 users (half non-engineers) • 7500 MapReduce jobs / day Median: 84s

  4. Approaches to Sharing • Hadoop default scheduler (FIFO) • Problem: short jobs get stuck behind long ones • Separate clusters • Problem 1: poor utilization • Problem 2: costly data replication • Full replication nearly infeasible at FB / Yahoo! scale • Partial replication prevents cross-dataset queries

  5. Our Work: Hadoop Fair Scheduler • Fine-grained sharing at level of map & reduce tasks • Predictable response times and user isolation • Example with 3 job pools:

  6. What This Talk is About • Two challenges for fair sharing in MapReduce • Data locality (conflicts with fairness) • Task interdependence (reduces can’t finish until maps are done, leading to slot hoarding) • Some solutions • Current work: Mesos project

  7. Outline • Challenge 1: data locality • Solution: delay scheduling • Challenge 2: reduce slot hoarding • Solution: copy-compute splitting • Current work: Mesos

  8. The Problem Master Job 1 Job 2 Scheduling order Task 2 Task 5 Task 3 Task 1 Task 7 Task 4 Slave Slave Slave Slave Slave Slave File 1: 2 3 6 2 5 8 3 5 8 1 4 9 1 7 9 4 6 7 File 2: 1 2 1 3 2 3

  9. The Problem Master Job 2 Job 1 Scheduling order Task 4 Task 3 Task 1 Task 2 Task 5 Task 1 Task 3 Task 2 Task 7 Slave Slave Slave Slave Slave Slave File 1: 2 3 6 2 5 8 3 5 8 1 4 9 1 7 9 4 6 7 File 2: 1 2 1 3 2 3 • Problem: Fair decision hurts locality • Especially bad for jobs with small input files

  10. Locality vs. Job Size at Facebook 58% of jobs

  11. Special Instance: Sticky Slots • Under fair sharing, locality can be poor even when all jobs have large input files • Problem: jobs get “stuck” in the same set of task slots • When one a task in job j finishes, the slot it was running in is given back to j, because j is below its share • Bad because data files are spread out across all nodes Master Task Task Task Task Slave Slave Slave Slave

  12. Special Instance: Sticky Slots • Under fair sharing, locality can be poor even when all jobs have large input files • Problem: jobs get “stuck” in the same set of task slots • When one a task in job j finishes, the slot it was running in is given back to j, because j is below its share • Bad because data files are spread out across all nodes

  13. Solution: Delay Scheduling • Relax queuing policy to make jobs wait for a limited time if they cannot launch local tasks • Result: Very short wait time (1-5s) is enough to get nearly 100% locality

  14. Delay Scheduling Example Wait! Master Job 2 Job 1 Scheduling order Task 1 Task 3 Task 5 Task 1 Task 8 Task 3 Task 2 Task 7 Task 4 Task 6 Task 2 Slave Slave Slave Slave Slave Slave File 1: 2 3 6 2 5 8 3 5 8 1 4 9 1 7 9 4 6 7 File 2: 1 2 1 3 2 3 • Idea: Wait a short time to get data-local scheduling opportunities

  15. Delay Scheduling Details • Scan jobs in order given by queuing policy, picking first that is permitted to launch a task • Jobs must wait before being permitted to launch non-local tasks • If wait < T1, only allow node-local tasks • If T1 < wait < T2, also allow rack-local • If wait > T2, also allow off-rack • Increase a job’s time waited when it is skipped

  16. Analysis • When is it worth it to wait, and for how long? • Waiting worth it if E(wait time) < cost to run non-locally • Simple model: Data-local scheduling opportunities arrive following Poisson process • Expected response time gain from delay scheduling: E(gain) = (1 – e-w/t)(d – t) • Optimal wait time is infinity Expected time to get local scheduling opportunity Wait amount Delay from running non-locally

  17. Analysis Continued E(gain) = (1 – e-w/t)(d – t) • Typical values of t and d: t = (avg task length) / (file replication × tasks per node) d ≥ (avg task length) Expected time to get local scheduling opportunity Wait amount Delay from running non-locally 3 6

  18. More Analysis • What if some nodes are occupied by long tasks? • Suppose we have: • Fraction  of active tasks are long • R replicas of each block • L task slots per node • For a given data block,P(all replicas blocked by long tasks) = RL • With R = 3 and L = 6, this is less than 2% if  < 0.8

  19. Evaluation • 100-node EC2 cluster, 4 cores/node • Job submission schedule based on job sizes and inter-arrival times at Facebook • 100 jobs grouped into 9 “bins” of sizes • Three workloads: • IO-heavy, CPU-heavy, mixed • Three schedulers: • FIFO • Fair sharing • Fair + delay scheduling (wait time = 5s)

  20. Results for IO-Heavy Workload Job Response Times Up to 5x 40% Small Jobs (1-10 input blocks) Medium Jobs (50-800 input blocks) Large Jobs (4800 input blocks)

  21. Results for IO-Heavy Workload

  22. Sticky Slots Microbenchmark • 5-50 big concurrent jobs on 100-node EC2 cluster • 5s delay scheduling 2x

  23. Summary • Delay scheduling works well under two conditions: • Sufficient fraction of tasks are short relative to jobs • There are many locations where a task can run efficiently • Blocks replicated across nodes, multiple tasks/node • Generalizes beyond fair sharing & MapReduce • Applies to any queuing policy that gives ordered job list • Used in Hadoop Fair Scheduler to implement hierarchical policy • Applies to placement needs other than data locality

  24. Outline • Challenge 1: data locality • Solution: delay scheduling • Challenge 2: reduce slot hoarding • Solution: copy-compute splitting • Current work: Mesos

  25. The Problem Job 1 Maps Job 2 Time Reduces Time

  26. The Problem Job 1 Maps Job 2 Time Job 2 maps done Reduces Time

  27. The Problem Job 1 Maps Job 2 Time Job 2 reduces done Reduces Time

  28. The Problem Job 1 Maps Job 2 Job 3 Time Job 3 submitted Reduces Time

  29. The Problem Job 1 Maps Job 2 Job 3 Time Job 3 maps done Reduces Time

  30. The Problem Job 1 Maps Job 2 Job 3 Time Reduces Time

  31. The Problem Job 1 Maps Job 2 Job 3 Time Problem: Job 3 can’t launch reduces until Job 1 finishes Reduces Time

  32. Solutions to Reduce Slot Hoarding • Task killing (preemption): • If a job’s fair share is not met after a timeout, kill tasks from jobs over their share • Pros: simple, predictable • Cons: wasted work • Available in Hadoop 0.21 and CDH 3

  33. Solutions to Reduce Slot Hoarding • Copy-compute splitting: • Divide reduce tasks into two types of tasks: “copy tasks” and “compute tasks” • Separate admission control for these two • Larger # of copy slots / node, with per-job cap • Smaller # of compute slots that tasks “graduate” into • Pros: no wasted work, overlaps IO & CPU more • Cons: extra memory, may cause IO contention

  34. Copy-Comp. Splitting in Single Job Maps Time Maps done Reduce Reduces Time

  35. Copy-Comp. Splitting in Single Job Maps Time Maps done Only IO Only Comp. Copy Reduces Compute Time

  36. Copy-Comp. Splitting in Single Job Maps Time Maps done Copy Reduces Compute Time

  37. Copy-Comp. Splitting Evaluation • Initial experiments showed 10% gain in single job and 1.5-2x response time gains over naïve fair sharing in multi-job scenario • Copy-compute splitting not yet implemented anywhere (killing was good enough for now)

  38. Outline • Challenge 1: data locality • Solution: delay scheduling • Challenge 2: reduce slot hoarding • Solution: copy-compute splitting • Current work: Mesos

  39. Challenges in Hadoop Cluster Management • Isolation (both fault and resource isolation) • E.g. if co-locating production and ad-hoc jobs • Testing and deploying upgrades • JobTrackerscalability (in larger clusters) • Running non-Hadoop jobs • MPI, streaming MapReduce, etc

  40. Mesos Mesosis a cluster resource manager over which multiple instances of Hadoop and other parallel applications can run Hadoop (production) Hadoop (ad-hoc) MPI … Mesos Node Node Node Node

  41. What’s Different About It? Other resource managers exist today, including • Batch schedulers (e.g. Torque, Sun Grid Engine) • VM schedulers (e.g. Eucalyptus) However, have 2 problems with Hadoop-likeworkloads: • Data localityhurt due to static partitioning • Utilization compromised because jobs hold nodes for their full duration Mesos addresses these through two features: • Fine-grained sharing at the level of tasks • Two-level scheduling where jobs control placement

  42. Mesos Architecture Hadoopjob Hadoop job MPI job Hadoop 0.19 scheduler Hadoop 0.20 scheduler MPI scheduler Mesosmaster Mesosslave Mesosslave Mesosslave MPI executor MPI executor Hadoop 19 executor Hadoop 20 executor Hadoop 19 executor task task task task task

  43. Mesos Status • Implemented in 10,000 lines of C++ • Efficient event-driven messaging lets master scale to 10,000 scheduling decisions / second • Fault tolerance using ZooKeeper • Supports Hadoop, MPI & Torque • Test deployments at Twitter and Facebook • Open source, at www.github.com/mesos

  44. ? Thanks! ? • Any questions? ? matei@eecs.berkeley.edu

More Related