1 / 43

Sparrow

Sparrow. Kay Ousterhout, Patrick Wendell, Matei Zaharia , Ion Stoica. Distributed Low-Latency Scheduling. Sparrow schedules tasks in clusters using a decentralized, randomized approach. Sparrow schedules tasks in clusters using a decentralized, randomized approach

baruch
Download Presentation

Sparrow

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sparrow Kay Ousterhout, Patrick Wendell, MateiZaharia, Ion Stoica Distributed Low-Latency Scheduling

  2. Sparrow schedules tasks in clusters using a decentralized, randomized approach

  3. Sparrow schedules tasks in clusters using a decentralized, randomized approach support constraints and fair sharing and provides response times within 12% of ideal

  4. Scheduling Setting … Map Reduce/Spark/Dryad Job Task Task Task Map Reduce/Spark/Dryad Job … … Task Task …

  5. Job Latencies Rapidly Decreasing 2012: Impala query 2010: Dremel Query 2010: In-memory Spark query 2004: MapReduce batch job 2009: Hive query 2013: Spark streaming 10 sec. 10 min. 100 ms 1 ms

  6. Scheduling challenges: Millisecond Latency Quality Placement Fault Tolerant High Throughput

  7. 1000 16-core machines 2012: Impala query 2010: Dremel Query 2010: In-memory Spark query 2004: MapReduce batch job 2009: Hive query 2013: Spark streaming 10 sec. 10 min. 100 ms 1 ms 1.6K decisions/second 26 decisions/second 160K decisions/second 16M decisions/second Scheduler throughput

  8. Today: Completely Centralized Sparrow: Completely Decentralized Less centralization Millisecond Latency Quality Placement Fault Tolerant High Throughput ✗ ✓ ? ✓ ✓ ✗ ✓ ✗ ✓

  9. Sparrow Decentralized approach Existing randomized approaches Batch Sampling Late Binding Analytical performance evaluation Handling constraints Fairness and policy enforcement Within 12% of ideal on 100 machines

  10. Scheduling with Sparrow Worker Scheduler Worker Job Scheduler Worker Worker Scheduler Worker Scheduler Worker

  11. Random Worker Scheduler Worker Job Scheduler Worker Worker Scheduler Worker Scheduler Worker

  12. Simulated Results 100-task jobs in 10,000-node cluster, exp. task durations Omniscient: infinitely fast centralized scheduler

  13. Per-task sampling Worker Scheduler Worker Job Scheduler Worker Worker Scheduler Worker Scheduler Worker Power of Two Choices

  14. Per-task sampling Worker Scheduler Worker Job Scheduler Worker Worker Scheduler Worker Scheduler Worker Power of Two Choices

  15. Simulated Results 100-task jobs in 10,000-node cluster, exp. task durations

  16. Response Time Grows with Tasks/Job! 70% cluster load

  17. Per-Task Sampling Per-task Worker ✓ Task 1 Scheduler Worker Job Scheduler Worker Task 2 Worker Scheduler Worker Scheduler Worker ✓

  18. Per-task Sampling Batch Per-task 4 probes (d = 2) Worker ✓ Scheduler Worker Job Scheduler Worker Worker Scheduler Worker Scheduler Worker ✓ Place m tasks on the least loaded of dmslaves

  19. Per-task versus Batch Sampling 70% cluster load

  20. Simulated Results 100-task jobs in 10,000-node cluster, exp. task durations

  21. 80 ms Queue length poor predictor of wait time 155 ms Worker Worker 530 ms Poor performance on heterogeneous workloads

  22. Late Binding 4 probes (d = 2) Worker Scheduler Worker Job Scheduler Worker Worker Scheduler Worker Scheduler Worker Place m tasks on the least loaded of dmslaves

  23. Late Binding 4 probes (d = 2) Worker Scheduler Worker Job Scheduler Worker Worker Scheduler Worker Scheduler Worker Place m tasks on the least loaded of dmslaves

  24. Late Binding Worker requests task Worker Scheduler Worker Job Scheduler Worker Worker Scheduler Worker Scheduler Worker Place m tasks on the least loaded of dmslaves

  25. Simulated Results 100-task jobs in 10,000-node cluster, exp. task durations

  26. What about constraints?

  27. Job Constraints Restrict probed machines to those that satisfy the constraint Worker Scheduler Worker Job Scheduler Worker Worker Scheduler Worker Scheduler Worker

  28. Per-Task Constraints Probe separately for each task Worker Scheduler Worker Job Scheduler Worker Worker Scheduler Worker Scheduler Worker

  29. Technique Recap Worker Scheduler Worker Batch sampling + Late binding + Constraint handling Scheduler Worker Worker Scheduler Worker Scheduler Worker

  30. How does Sparrow perform on a real cluster?

  31. Spark on Sparrow Sparrow Scheduler Worker Worker Query: DAG of Stages Worker Job Worker Worker Worker

  32. Spark on Sparrow Sparrow Scheduler Worker Worker Query: DAG of Stages Worker Worker Job Worker Worker

  33. Spark on Sparrow Sparrow Scheduler Worker Worker Query: DAG of Stages Worker Worker Worker Job Worker

  34. How does Sparrow compare to Spark’s native scheduler? 100 16-core EC2 nodes, 10 tasks/job, 10 schedulers, 80% load

  35. TPC-H Queries: Background TPC-H: Common benchmark for analytics workloads Shark: SQL execution engine Spark: Distributed in-memory analytics framework Sparrow

  36. TPC-H Queries Percentiles 100 16-core EC2 nodes, 10 schedulers, 80% load 95 75 50 25 5

  37. TPC-H Queries 100 16-core EC2 nodes, 10 schedulers, 80% load Within 12% of ideal Median queuing delay of 9ms

  38. Fault Tolerance Timeout: 100ms Failover: 5ms Re-launch queries: 15ms ✗ Scheduler 1 Spark Client 1 Spark Client 2 Scheduler 2

  39. When does Sparrow not work as well? High cluster load

  40. Related Work Centralized task schedulers: e.g., Quincy Two level schedulers: e.g., YARN, Mesos Coarse-grained cluster schedulers: e.g., Omega Load balancing: single task

  41. www.github.com/radlab/sparrow Worker Scheduler Worker Batch sampling + Late binding + Constraint handling Scheduler Worker Worker Scheduler Worker Scheduler Worker Sparrows provides near-ideal job response times without global visibility

  42. Backup Slides

  43. Policy Enforcement Can we do better without losing simplicity? Priorities Serve queues based on strict priorities Fair Shares Serve queues using weighted fair queuing Slave Slave High Priority User A (75%) User B (25%) Low Priority

More Related