1 / 21

Natjam : Supporting Deadlines and Priorities in a Mapreduce Cluster

Natjam : Supporting Deadlines and Priorities in a Mapreduce Cluster. Brian Cho (Samsung/Illinois), Muntasir Rahman , Tej Chajed , Indranil Gupta , Cristina Abad, Nathan Roberts (Yahoo! Inc.), Philbert Lin University of Illinois (Urbana-Champaign). Hadoop Jobs have Priorities.

yakov
Download Presentation

Natjam : Supporting Deadlines and Priorities in a Mapreduce Cluster

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natjam: Supporting Deadlines and Priorities in a Mapreduce Cluster Brian Cho (Samsung/Illinois), MuntasirRahman, TejChajed, Indranil Gupta, Cristina Abad, Nathan Roberts (Yahoo! Inc.), Philbert Lin University of Illinois (Urbana-Champaign) Distributed Protocols Research Group (DPRG): http://dprg.cs.uiuc.edu

  2. Hadoop Jobs have Priorities • Dual Priority Case • Production jobs (high priority) • Time sensitive • Directly affect criticality or revenue • Research jobs (low priority) • e.g., long term analysis • Example: Ad provider Ad click-through logs Count clicks Is there a better way to place ads? Update ads Slow counts → Show old ads → Don’t get paid $$$ Run machine learning analysis Prioritize production jobs Daily and Historical logs. http://dprg.cs.uiuc.edu

  3. State-of-the-art: Separate clusters • Production cluster receives production jobs (high priority) • Research cluster receives research jobs (low priority) • Traces reveal large periods of under-utilization in each cluster • Long job completion times • Human involvement in job management • Goal: single consolidated cluster for all priorities and deadlines • Prioritize production jobs and yet affect research jobs least • Today’s Options: • Wait for research tasks to finish(e.g., Capacity Scheduler)  Prolongs production jobs • Kill research tasks (e.g., Fair Scheduler) can lead to repeated work  Prolongs research jobs http://dprg.cs.uiuc.edu

  4. Natjam’s Techniques • Scale down research jobs by • Preempting some Reduce tasks • Fast on-demand automated checkpointing of task state • Later, reduces can resume where they left off • Focus on Reduces: Reduce tasks take longer, so more work to lose (median Map 19 seconds vs. Reduce 231 seconds [Facebook]) • Job Eviction Policies • Task Eviction Policies http://dprg.cs.uiuc.edu

  5. Natjam built into Hadoop YARN Architecture Resource Manager • Preemptor • Chooses Victim Job • Reclaims queue resources • Releaser • Chooses Victim Task • Local Suspender • Saves state of Victim Task Capacity Scheduler preempt() Preemptor ask container Node A Node B # containers to release Node Manager A Node Manager B suspend Task (App1) Application Master 1 Task (App2) Application Master 2 Task (App2) resume() saved state release() (empty container) Local Suspender Local Suspender Releaser Releaser http://dprg.cs.uiuc.edu

  6. Suspending and Resuming Tasks (Suspended) Container freed, Suspend state saved HDFS Task Attempt 1 tmp/task_att_1 • Existing intermediate data used • Reduce inputs,stored at local host • Reduce outputs,stored on HDFS • Suspended task state saved locally, so resume can avoid network overhead • Checkpoint state saved • Key counter • Reduce input path • Hostname • List of suspended task attempt IDs Key Counter Key Counter outdir/ Inputs (Resumed) Task Attempt 2 tmp/task_att_2 (skip) Inputs http://dprg.cs.uiuc.edu

  7. Two-level Eviction Policies Resource Manager Capacity Scheduler • On a container request in a full cluster: • JobEviction • @Preemptor • Task Eviction • @Releaser preempt() Preemptor Node A Node B Node Manager A Node Manager B # containers to release Application Master 1 Task (App2) Application Master 2 Task (App2) release() Local Suspender Local Suspender Releaser Releaser http://dprg.cs.uiuc.edu

  8. Job Eviction Policies • Based on total amount of resources (e.g., containers) held by victim job (known at Resource Manager) • Least Resources (LR)  Large research jobs unaffected  Starvation for small research jobs (e.g., repeated production arrivals) • Most Resources (MR)  Small research jobs unaffected  Starvation for the largest research job • Probabilistically-weighted on Resources (PR)  Weigh jobs by number of containers: treats all tasks same, across jobs  Affects multiple research jobs http://dprg.cs.uiuc.edu

  9. Task Eviction Policies • Based on time remaining (known at Application Master) • Shortest Remaining Time (SRT)  Leaves the tail of research job alone  Holds on to containers that would be released soon • Longest Remaining Time (LRT)  May lengthen the tail • Releases more containers earlier • However: SRTprovably optimal under some conditions • Counter-intuitive. SRT = Longest-job-first scheduling. Now http://dprg.cs.uiuc.edu

  10. Eviction Policies in Practice • Task Eviction • SRT 20% faster than LRT for research jobs • Production job similar across SRT vs. LRT • Theorem: When research tasks resume simultaneously, SRT results in shortest job completion time. • Job Eviction • MR best • PR very close behind • LR 14%-23% worse than MR • MR + SRT best combination http://dprg.cs.uiuc.edu

  11. Natjam-R: Multiple Priorities • Special case of priorities: jobs with real-time deadlines • Best-effort only (no admission control) • Resource Manager keeps single queue of jobs sorted by increasing priority (derived from deadline) • Periodically scans queue: evicts later job to give to earlier waiting job • Job Eviction Policies • Maximum Deadline First (MDF): Priority = Deadline • Prefers short deadline jobs  May miss deadlines, e.g., schedules a large job instead of a small job with a slightly large deadline • Maximum Laxity First • Priority = Laxity = Deadline minus Job’s Projected Completion time • Pays attention to job’s resource requirements

  12. MDF vs. MLF in Practice Job deadlines MDF prefers short deadlines MLF moves in lockstep Misses all deadlines • 8 node cluster • Yahoo! trace experiments in paper

  13. Natjam vs. Alternatives time (seconds) • Microbenchmark: • 7 node cluster 7% worse than Ideal 40% better than Soft cap 50% worse than ideal 90% worse than ideal 20% worse than ideal 2% worse than ideal 15% better than Killing Empty Cluster t=50s Production-S (25% of cluster) t=0s Research-XL (100% of cluster)

  14. Large Experiments • 250 nodes @Yahoo!, Driven by Yahoo! traces • Natjamvs. Waiting for research tasks (Hadoop Capacity Scheduler: Soft cap) • Production jobs: 53% benefit, 97% delayed < 5 s • Research jobs: 63% benefit, very few outliers (low starvation) • Natjamvs. Killing research tasks • Production jobs: largely unaffected • Research jobs: • 38% finish faster than 100 s • 5th percentile faster than 750 s • Biggest improvement: 1880 s • Negligible starvation http://dprg.cs.uiuc.edu

  15. Related Work • Single cluster job scheduling has focused on: • Locality of Map tasks [Quincy, Delay Scheduling] • Speculative execution [LATE Scheduler] • Average fairness between queues [Capacity Scheduler, Fair Scheduler] • Recent work: Elastic queues but uses Sailfish – needs special intermediate file system, does not work with Hadoop [Amoeba] • Mapreduce-5269 JIRA: Preemption in Hadoop http://dprg.cs.uiuc.edu

  16. Takeaways • Natjam supports dual priority and arbitrary priorities (derived from deadlines) • SRT (Shortest remaining time) best policy for task eviction • MR (Most resources) best policy for job eviction • MDF (Maximum deadline first) best policy for job eviction in Natjam-R • 2-7% Overhead for dual priority case • Please see our poster + demo video later today! http://dprg.cs.uiuc.edu

  17. Backup slides http://dprg.cs.uiuc.edu

  18. Contributions • Our system Natjam allows us to • Maintain one cluster • With a production queue and a research queue • Prioritize production jobs and complete them quickly • While affecting research jobs the least • (Later: Extend to multiple priorities.) http://dprg.cs.uiuc.edu

  19. Hadoop 23’s Capacity Scheduler • Limitation: research jobs cannot scale down • Hadoop capacity shared using queues • Guaranteed capacity (G) • Maximum capacity(M) • Example • Production (P) queue:G 80%/M 80% • Research (R) queue:G 20%/M 40% • Production jobsubmitted first: • Research jobsubmitted first: (under-utilization) P takes 80% R takes 40% time → R can only grow to 40% P cannot grow beyond 60% (under-utilization) time → http://dprg.cs.uiuc.edu

  20. Natjam Scheduler • Does not require Maximum capacity • Scales down research jobs by • Preempting Reduce tasks • Fast on-demand automated checkpointing of task state • Resumption where it left off • Focus on Reduces: Reduce tasks take longer, so more work to lose (median Map 19 seconds vs. Reduce 231 seconds [Facebook]) • P/R Guaranteed: 80%/20% • P/RGuaranteed: 100%/0% R takes 100% R takes 100% time → P takes 80% P takes 100% time → Prioritize Production Jobs http://dprg.cs.uiuc.edu

  21. Yahoo! Hadoop Traces:CDF of differences (negative is good) 7-node cluster 250-node Yahoo! cluster Only two starved jobs 260 s and 390 s Largest benefit 1880 s

More Related