450 likes | 574 Views
Satisfying Strong Application Requirements in Data-Intensive Clouds. Ph.D Final Exam Brian Cho. Motivating scenario: Using the data-intensive cloud. Researchers contract with defense agency to investigate ongoing suspicious activity e.g., botnet attack, worm, etc.
E N D
Satisfying Strong Application Requirementsin Data-Intensive Clouds Ph.D Final Exam Brian Cho
Motivating scenario: Using thedata-intensive cloud • Researchers contract with defense agency to investigate ongoing suspicious activity • e.g., botnet attack, worm, etc. • Other applications: processing click logs, news items, etc. • Transfer large logs (TBs-PBs) from possible victim sites • Run computations on logs to find vulnerabilities and source of attack • Store data
Can today’s data-intensive cloud meet these demands? The researchers require: • Control over time and $ cost of transfer, to stay within the contracted budget and time • Prioritization of this time-sensitive job over other jobs in its cluster • Consistent updates and reads at data store • Current limitation: Systems are built to optimize key metrics at large scales, but not to meet these strong user requirements
Strong user requirements • Many real-world requirements are too important to relax • Time • $$$ • Priority • Data consistency • It is essential to treat these strong requirements as problem constraints • … not just as side effects of resource limitations in the cloud
Thesis statement • It is feasible to satisfy strong application requirements for data-intensive cloud computing environments, in spite of resource limitations, while simultaneously optimizing run-time metrics. • Strong application requirements: real-time deadlines, dollar budgets, data consistency, etc. • Resource limitations: finite compute nodes, limited bandwidth, high latency, frequent failures, etc. • Run-time metrics: throughput, latency, $ cost, etc.
Contributions: Practical solutions Bulk Data Transfer Computation Key-value Storage
Pandora-A: Bulk Data Transfer via Internet and Shipping Networks • Minimize $ costsubject to time deadline • Transfer options • Internet links with proportional costs but limited bandwidth • Shipping links with fixed costs and shipping times depending on method (e.g. ground, air) • Solution • Transform into time-expanded network • Solve min-cost flow on network • Trace-driven experiments • Pandora-A solutions better than direct Internet or shipping
Pandora-B: Bulk Data Transfer via Internet and Shipping Networks UB LB • Minimize transfer timesubject to $ budget • Bounded binary search on Pandora-A solutions • Bounds created by transforming time-expanded networks Dollar Cost ($) B Transfer Time T (hrs)
Vivace: Consistent data for congested geo-distributed systems • Strongly consistentkey-value store • Low latency across geo-distributed data centers • Under congestion • New algorithms • Prioritize a small amount of critical information • To avoid delay due to congestion • Evaluated using a practical prioritization infrastructure
Natjam: Prioritizing production jobsin MapReduce/Hadoop • Mixed workloads • Production jobs • Time sensitive • Directly affect revenue • Research jobs • e.g., long term analysis • Example: Ad provider Ad click-through logs Count clicks Update ads Is there a better way to place ads? Slow counts → Show old ads → Don’t get paid $$$ Run machine learning analysis Prioritize production jobs Lots of historical logs. Need a large cluster.
Contributions • Natjamprioritizes production jobs • While giving research jobs spare capacity • Suspend/Resume tasks in research jobs • Production jobs can gain resources immediately • Research jobs can use many resources at a time, without wasting work • Develop eviction policies that choose which tasks to suspend
Natjam Outline • Motivation • Contributions • Background: MapReduce/Hadoop • State-of-the-art • Solution: Suspend/Resume • Design • Evaluation
Background: MapReduce/Hadoop • Distributed computation on large cluster • Each job consists of Map and Reduce tasks • Job stages • Map tasks run computations in parallel • Shuffle combines intermediate Map outputs • Reduce tasks run computations in parallel M R M M R R M M
Background: MapReduce/Hadoop • Distributed computation on large cluster • Each job consists of Map and Reduce tasks • Job stages • Map tasks run computations in parallel • Shuffle combines intermediate Map outputs • Reduce tasks run computations in parallel • Map input/Reduce output stored in distributed file system (e.g. HDFS) • Scheduling: Which task to run on empty resources (slots) Job 1 Job 3 M R M M R R R M R R M M M R M M R M M M M M M M Job 2
State-of-the-art: Separate clusters • Submit production jobs to a production cluster • Submit research jobs to a research cluster
State-of-the-art: Separate clusters • Submit production jobs to a production cluster • Submit research jobs to a research cluster • Trace of job submissions to Yahoo production cluster • Periods of under-utilization, where research jobs could potentially fill in 10000 # Reduce slots 8000 Reduce slot capacity 6000 4000 ( under- utilization ) 2000 0 1:00 0:20 0:40 time (hours:mins) Plot used with permission from Yahoo
State-of-the-art: Single clusterHadoop scheduling • Ideally, • Enough capacity for production jobs • Run research tasks on all idle production slots • But, • Killing tasks (e.g. Fair Scheduler) can lead to wasted work 10000 # Reduce slots wasted work 8000 Reduce slot capacity 6000 4000 ( under- utilization ) 2000 0 1:00 0:20 0:40 time (hours:mins) Plot used with permission from Yahoo
State-of-the-art: Single clusterHadoop scheduling • Ideally, • Enough capacity for production jobs • Run research tasks on all idle production slots • But, • Killing tasks (e.g. Fair Scheduler) can lead to wasted work • No preemption (e.g. Capacity Scheduler) can lead to production jobs waiting for resources 10000 # Reduce slots 8000 Reduce slot capacity 6000 4000 production jobs aren’t assigned resources 2000 0 1:00 0:20 0:40 time (hours:mins) Plot used with permission from Yahoo
Approach: Suspend/Resume • Suspend/Resume tasks within and across research jobs • Production jobs can gain resources immediately • Research jobs can use many resources at a time, without wasting work • Focus on Reduce tasks • Reduce tasks take longer, so more work to lose (median Map 19 seconds vs. Reduce 231 seconds [Facebook]) 10000 # Reduce slots 8000 Reduce slot capacity 6000 4000 2000 0 1:00 0:20 0:40 time (hours:mins) Plot used with permission from Yahoo
Goals: Prioritize production jobs • Requirement: Production jobs should have the same completion time as if they were executed in an exclusive production cluster • Possibly with a small overhead • Optimization: Research jobs should have the shortest completion time possible • Constraint: Finite cluster resources
Challenges • Avoid Suspend overhead • Would require production jobs to wait for resources • Avoid Resume overhead • Would delay research jobs from making progress • Optimize task evictions • Job completion time is metric that users care about • Develop eviction policies that have the least impact on job completion times
Natjam Design • Motivation • Contributions • Background: MapReduce/Hadoop • State-of-the-art • Solution: Suspend/Resume • Design • Evaluation • Scheduler • Hadoop→ Natjam • Architecture • Hadoop→ Natjam • Suspend/Resume tasks • Eviction Policies • Task • Job
Background: Capacity Scheduler • Limitation: research jobs cannot scale down • Hadoop capacity shared using queues • Guaranteed capacity (G) • Maximum capacity(M)
Background: Capacity Scheduler • Limitation: research jobs cannot scale down • Hadoop capacity shared using queues • Guaranteed capacity (G) • Maximum capacity(M) • Example • Production (P) queue:G 80%/M 80% • Research (R) queue:G 20%/M 40% • Production jobsubmitted first: • Research jobsubmitted first: (under-utilization) P takes 80% R takes 40% time → R grows to 40% P cannot grow beyond 60% (under-utilization) time →
Natjam Scheduler • Does not require Maximum capacity • Scales down research jobs
Natjam Scheduler • Does not require Maximum capacity • Scales down research jobs • P/R Guaranteed: 80%/20% • P/RGuaranteed: 100%/0% R takes 100% R takes 100% time → P takes 80% P takes 100% time → Prioritize Production Jobs
Background: Hadoop YARN architecture • Resource Manager • Application Master per application • Tasks are launched on containers of memory • Formerly, slots in Hadoop Resource Manager Capacity Scheduler ask container Node A Node B Node Manager A Node Manager B Task (App1) Application Master 1 Application Master 2 Task (App2) (empty container)
Suspend/Resume architecture • Preemptor • Decides when resources should be reclaimed from queues • Chooses victim job • Releaser • Chooses task to evict • Local Suspender • Saves state • Promptly exits • Messaging overheads Resource Manager Capacity Scheduler preempt() Preemptor ask container # containers to release Node A Node B Node Manager A Node Manager B suspend Task (App1) Application Master 1 Task (App2) Application Master 2 Task (App2) saved state resume() release() (empty container) Local Suspender Local Suspender Releaser Releaser
Suspending and Resuming Tasks • When suspending, we must save enough state to be used when resuming the task. • By using existing intermediate datawe save small state • Simple • Low overhead
Suspending and Resuming Tasks • Existing intermediate data used • Reduce inputs,stored at local host • Reduce outputs,stored on HDFS • Suspend state saved • Key counter • Reduce input path • Hostname • List of suspended task attempt IDs (Suspended) Container freed, Suspend state saved HDFS Task Attempt 1 tmp/task_att_1 Key Counter Key Counter outdir/ Inputs (Resumed) Task Attempt 2 tmp/task_att_2 (skip) Inputs
Two-level Eviction Policies • Job-level Eviction • Chooses victim job • Task level-eviction • Chooses task to evict Resource Manager Capacity Scheduler preempt() Preemptor # containers to release Node A Node B Node Manager A Node Manager B Application Master 1 Task (App2) Application Master 2 Task (App2) release() Local Suspender Local Suspender Releaser Releaser
Task eviction policies • Based on time remaining • Last task to finish decides job completion time • Task that finishes earlier releases container earlier • Application Master keeps track of time remaining • Shortest Remaining Time (SRT) Shortens the tail Holds on to containers that would be released soon • Longest Remaining Time (LRT) May lengthen the tail Releases containers as soon as possible
Job eviction policies • Based on amount of resources (e.g. memory) held by job • Resource Manager holds resource information • Least Resources (LR) Large jobs benefit Starvation even with small production jobs • Most Resources (MR) Small jobs benefit Large jobs may be delayed for a long time • Probabilistically-weighted on Resources (PR) Avoids biasing tasks: chance of eviction for task is same across all jobs, assuming random task eviction policy Many jobs may be delayed
Evaluation • Microbenchmarks • Trace-driven experiments • Natjam was implemented based on Hadoop 0.23 (YARN) • 7-node cluster in CCT
Microbenchmarks: Setup • Avg completion times on empty cluster • Research Job: ~200s • Production Job: ~70s • Job sizes: XL (100% of cluster), L (75%), M (50%), S (25%) • Task workloads within a job chosen uniformly between range of (1/2 of largest task, largest task]
Microbenchmark: Comparing Natjam to other techniques time (seconds) 7% more than ideal 40% less than Soft cap 50% more than ideal 90% more than ideal 20% more than ideal 2% more than ideal 15% less than Killing t=50s Production-S t=0s Research-XL
Microbenchmark:Suspend overhead • 1.25s increase due to messaging delays • Task assignments happen in parallel: 4.7s increase in job completion time is • Assign Application Master • Assign Map tasks • Assign Reduce tasks 1.25 s (50%) increase
Microbenchmark:Task eviction policies time (seconds) 17% less than Random t=50s Production-S t=0s Research-XL Theorem 1: When production tasks are the same length, SRT results in shortest job completion time.
Microbenchmark:Job eviction policies time (seconds) Most Resources + SRT = good fit t=50s Production-S t=0s Research-L Research-S Theorem 2: When tasks within each job are the same length, evicting from the minimum number of jobs results in the shortest average job completion time.
Trace-driven evaluation • Yahoo trace: scaled production cluster workload + scaled research cluster • Job completion times
Trace-driven evaluation:Research jobs only 115 seconds
Trace-driven evaluation:CDF of differences (negative is good)
Related Work • Single cluster job scheduling has focused on: • Locality of Map tasks [Quincy, Delay Scheduling] • Speculative execution [LATE Scheduler] • Average fairness between queues [Capacity Scheduler, Fair Scheduler] • Recent work: Elastic queues [Amoeba] • We solve the requirement of prioritizing production jobs
Natjam summary • Natjamprioritizes production jobs • Suspend/Resume tasks in research jobs • Eviction policies that choose which tasks to suspend • Evaluation • Microbenchmarks • Trace-drive experiments
Conclusion • Thesis: It is feasible to satisfystrong application requirementsfor data-intensive cloud computing environments, in spite ofresource limitations,while simultaneously optimizingrun-time metrics. • Contributions: Solutions that reinforce this statement in diverse data-intensive cloud settings.