Shared Scan Batch Scheduling in Cloud Computing

Shared Scan Batch Scheduling in Cloud Computing

Project Goals Eliminate redundant data processing for concurrent workflows that access the same dataset in the Cloud • Batch MapReduce workflows to enable scan sharing • Single pass scan of shared data segments • Alleviate contention and improve scalability • Utilize fewer map/reduce slots under load • Data-intensive workloads (tens of minutes to hours) • Joins across multiple datasets • User specified rewards for early completion • Trade-offs between efficient resource utilization and deadlines

Data-Driven Batch Scheduling Turbulence DB Data Access by Query Co-schedule by Sub-query • Throughput scales with contention (Astro. & Turbulence) • Decompose into sub-queries based on data access • Co-schedule sub-queries to amortize I/O • Evaluate data atoms based on utility metric • Reordering based on contention vs. arrival order (CIDR’09) • Adaptive starvation resistance • Job-aware (queries with data dependency) (SC’10) R2 Decomposition Batch Sched. Q1 Query Results R1 Q2 R3 Q3 R3

Application in Cloud Computing • Fixed Cloud (fixed resources) • Single pass scan of shared data • Alleviate contention (utilize less map/reduce slots, shared loading and shuffling of data) • Earn rewards for early completion (soft deadlines) • Local improvement w/ simulated annealing, greedy ordering • Elastic Cloud • Machine charge = (# of machines) x (# hours) • Speed-up factors w/ more machines (i.e. more parallelism) • Add machines to meet soft deadlines • Aggressive batching to minimize machine charge (efficiency)

App 1 App 2 App 3 Advanced workflow: Nova Simple workflow:Oozie Dataflow: Pig Processing: Hadoop M-R Storage: HDFS Nova Workflow Platform • What is Nova? • Content mgmt and workflow scheduling for the Cloud • Leverages existing resources • Cloud Data: HDFS/Zebra storage • Cloud Computing: Oozie, Pig/MR/Hadoop • Users define complex workflows in Oozie that consume the data Sample Pig A = load ‘input1' as (a, b, c); B = filter A by a > 5; store B into 'output1'; C = group B by b; store C into 'output2'; Oozie Workflow engine for coordinating Map-Reduce/Pig jobs in Hadoop (i.e. Workflow DAG in which nodes are MR tasks and edges are dataflows)

Sample Nova Workflow crawler output crawled pages candidate entity occurrences Nova Data Nova Data Nova Data Nova Data (url, content) (url, entity string) entities Nova Tasks Nova Task Nova Task validated entity occurrences cand. entity extractor join group-wise count (entity id, entity string) (url, entity id) Nova Data entity occurrence counts editors (entity id, count)

Nova Workflow 1 c1s0 c3s0 Nova Workflow 1.2 (scans c2s0 once) c2s0 c4s0 c1s0 c3s0 Workflow Merger Nova Workflow 2 c4s0 c2s0 c5s0 c5s0 c2s0 Shared Scan via Workflow Merging Output Data Input Data Pig/MR Tasks Sample Use Cases in Nova • Concurrent research, production, maintenance workflows over same data • Content enrichment workflows (i.e. dedup, clustering) over news content • Webmap workflows consuming same URL table

Input Data Performance Impact (1) Shared Loading (network, redundant proc.) (2) Consolidated computation (shared startup/tear down) (3) Reducer parallelism (Max/Sum # of reducers) (1) Split(Tuple) Nested Plan Nested Plan … (2) Nested Plan Map1 Map2 Mapn Combine(Tuple) Shuffle Demux(Tuple) … Nested Plan (3) Nested Plan Nested Plan Reduce1 Reduce2 Reducem Output Data Output Data Output Data Output Data Output Data Output Data Output Data Output Data Output Data

Completion Time by Scheduling Strategy Performance in Nova for different enrichment workflows (ie. de-dup) on news content (SIGMOD’11)

Utilization of Grid Resources (Slot Time)

PigMix: Load Cost Savings

PigMix: Estimating Makespan

Ongoing Work • Starvation resistance • Account for heterogeneity in workflow sizes • Provide soft deadline guarantees • Handling cascading failures • Prefer jobs with high load cost (less dilation, high slot time savings, map-only jobs) • Predicting workflow runtime and frequency • Robustness to inaccuracies in cost estimates • Conserve or expend Cloud resources based on deadline requirements and system load • Jobs that join/scan multiple input sources

Questions?

Nova Workflow Platform • Nova features • Abstraction for complex workflows that consume data • Incrementally arriving data (logs, crawls, feeds, ...) • Incremental processing of arriving data • Stateless: shingle every newly-crawled page • Stateful: maintain inlink counts as web grows • Scheduling processing steps • Periodic: run inlink counter once per week • Triggered: run inlink counter after link extractor • Provides provenance, metadata management, incremental processing (i.e. joins), data replication, transactional guarantees

PigMix: Reducer Parallelism

Optimizing for Shared Scan • Define a job J (i.e. MapReduce or Pig) • Scans files f(J) = (F1, …, Fi), scan time per file: s(Fi) • Fixed processing cost c(J) • d(J) defines a soft deadline of each job • Step: d defined by n pairs of (ti, pi) where 0<ti< ti+1and pi>pi+1 (a job that completes by ti is award pi points) • Linearly decay: enforce eventual completion w/ negative pts • Cost of shared scan for Jobs J1and J2 c(J1) + c(J2) + ∑Fє(f(J1) U f(J2)) s(F) • Maximize points and minimize resources • Local improvement w/ simulated annealing, greedy ordering • Aggressive batching when load is high

Performance Evaluation • Experimental Setup • Nova with Shared Scan Module • 200 node Hadoop cluster • 128MB HDFS block size • 1GB RAM per node • 640 mapper and 320 reducer slots • Shingling workflow (offline content enrichment) • De-duplication of news • Filter and extract features from content • Cluster content by feature and pick one per cluster • Execution of multiple de-dup workflows using different clustering alg. • Scheduling strategies compared • Sequential-NoMerge (slower, conserve Grid resources) • Concurrent-NoMerge (fast, elastic Grid resources) • Merged (fast, conserve Grid resources)

Shared Scan Batch Scheduling in Cloud Computing

Shared Scan Batch Scheduling in Cloud Computing

Presentation Transcript

Scheduling in Batch Systems

Batch Scheduling of Conflicting Jobs

Dynamic Scan Scheduling Specification

Batch Computing at Altera

Cloud Computing in Libraries

Cloud Computing: A Problem Shared

Scheduling in Cloud

CLOUD BATCH

Cloud Scheduling

Cross-Layer Scheduling in Cloud Computing Systems

Batch Scheduling at CERN (LSF)

Big data–aware scheduling with uncertainty in Cloud Computing

Cloud Scheduling

Shared Resource Monitoring and Throughput Optimization in Cloud-Computing Datacenters

Cloud-DLS: Dynamic trusted scheduling for Cloud computing

Use Of Batch Job Scheduling In Business

Cloud Computing Services | Cloud Computing Training in Gurgaon

Cloud Computing Training in Chandigarh | Cloud computing Course in Chandigarh

Cloud Computing Training in Noida & Delhi | Cloud Computing .

Scheduling Algorithms in Cloud Computing An Extensive Survey

Batch Computing at Altera