290 likes | 366 Views
The Case for Tiny Tasks in Compute Clusters. Kay Ousterhout * , Aurojit Panda * , Joshua Rosen * , Shivaram Venkataraman * , Reynold Xin * , Sylvia Ratnasamy * , Scott Shenker *+ , Ion Stoica *. * UC Berkeley, + ICSI. Setting. …. Task. Task. Map Reduce/Spark/Dryad Job. Task. ….
E N D
The Case for Tiny Tasks in Compute Clusters Kay Ousterhout*, Aurojit Panda*, Joshua Rosen*, ShivaramVenkataraman*, ReynoldXin*, Sylvia Ratnasamy*, Scott Shenker*+, Ion Stoica* * UC Berkeley, +ICSI
Setting … Task Task Map Reduce/Spark/Dryad Job Task … … Task
Use smaller tasks! Today’s tasks Tiny Tasks
Why? How? Where?
Why? How? Where?
Problem: Skew and Stragglers Contended machine? Data skew?
Benefit: Handling of Skew and Stragglers Today’s tasks Tiny Tasks As much as 5.2x reduction in job completion time!
Problem: Batch and Interactive Sharing Clusters forced to trade off utilization and responsiveness! Low priority batch task High priority interactive job arrives
Benefit: Improved Sharing Today’s tasks Tiny Tasks High-priority tasks not subject to long wait times!
Benefits: Recap (1) Straggler mitigation (2) Improved sharing Mantri (OSDI ‘10) Scarlett (EuroSys’11) SkewTune (SIGMOD ‘12) Dolly (NSDI ’13) … Quincy (SOSP ‘09) Amoeba (SOCC ’12) …
Why? How? Where?
Schedule task Scheduling requirements: High Throughput (millions per second) Low Latency (milliseconds) Distributed Scheduling (e.g., Sparrow Scheduler)
Schedule task Use existing thread pool to launch tasks Launch task
Schedule task Use existing thread pool to launch tasks + Cache task binaries Launch task Task launch = RPC time (<1ms)
Schedule task Smallest efficient file block size: Launch task 8MB Read input data Distribute Metadata (à la Flat Datacenter Storage, OSDI ‘12)
Schedule task Launch task Read input data … … Tons of tiny transfers! Execute task + read data for next task Framework-Controlled I/O (enables optimizations, e.g., pipelining)
Schedule task How low can you go? Launch task 8MB disk block Read input data 100’s of milliseconds Execute task + read data for next task
Why? How? Where?
Original Job Tiny Tasks Job 1 2 3 4 N Map Task 1 Map Tasks … … Map Task 2 … K1: K1: K1: K1: K2: K5: Reduce Tasks K1: K2: K5: K2: K2: … … K1: K3: Reduce Task 1 Kn: Kn: Kn:
Original Reduce Phase K1: K1: K1: K1: K1: K1: K1: K1: K1: K1: K1: K1: K1: K1: K1: K1: Reduce Task 1 Tiny Tasks = ?
Splitting Large Tasks • Aggregation trees • Works for functions that are associative and commutative • Framework-managed temporary state store • Ultimately, need to allow a small number of large tasks
Tiny tasks mitigate stragglers + Improve sharing Distributed file metadata Launch task in existing thread pool Pipelined task execution Distributed scheduling Questions? Find me or Shivaram:
Benefit of Eliminating StragglersBased on Facebook Trace 5.2x at the 95th percentile!
Why Not Preemption? • Preemption only handles sharing (not stragglers) • Task migration is time consuming • Tiny tasks improve fault tolerance
Dremel/Drill/Impala • Similar goals and challenges (supporting short tasks) • Dremel statically assigns tablets to machines; rebalances if query dispatcher notices that a machine is processing a tablet slowly standard straggler mitigation • Most jobs expected to be interactive (no sharing)
Scheduling Throughput 10,000 Machines 16 cores/machine 100 millisecond tasks Over 1 million task scheduling decisions per second
Sparrow: Technique Place m tasks on the least loaded of dm slaves 4 probes (d = 2) Slave Scheduler Slave Job Scheduler Slave m = 2 tasks Slave Scheduler Slave Scheduler Slave More at tinyurl.com/sparrow-scheduler
Sparrow: Performance on TPC-H Workload Within 12% of offline optimal; median queuing delay of 8ms More at tinyurl.com/sparrow-scheduler