1 / 18

Shared Scan Batch Scheduling in Cloud Computing

Shared Scan Batch Scheduling in Cloud Computing. Project Goals. Eliminate redundant data processing for concurrent workflows that access the same dataset in the Cloud Batch MapReduce workflows to enable scan sharing Single pass scan of shared data segments

rowdy
Download Presentation

Shared Scan Batch Scheduling in Cloud Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Shared Scan Batch Scheduling in Cloud Computing

  2. Project Goals Eliminate redundant data processing for concurrent workflows that access the same dataset in the Cloud • Batch MapReduce workflows to enable scan sharing • Single pass scan of shared data segments • Alleviate contention and improve scalability • Utilize fewer map/reduce slots under load • Data-intensive workloads (tens of minutes to hours) • Joins across multiple datasets • User specified rewards for early completion • Trade-offs between efficient resource utilization and deadlines

  3. Data-Driven Batch Scheduling Turbulence DB Data Access by Query Co-schedule by Sub-query • Throughput scales with contention (Astro. & Turbulence) • Decompose into sub-queries based on data access • Co-schedule sub-queries to amortize I/O • Evaluate data atoms based on utility metric • Reordering based on contention vs. arrival order (CIDR’09) • Adaptive starvation resistance • Job-aware (queries with data dependency) (SC’10) R2 Decomposition Batch Sched. Q1 Query Results R1 Q2 R3 Q3 R3

  4. Application in Cloud Computing • Fixed Cloud (fixed resources) • Single pass scan of shared data • Alleviate contention (utilize less map/reduce slots, shared loading and shuffling of data) • Earn rewards for early completion (soft deadlines) • Local improvement w/ simulated annealing, greedy ordering • Elastic Cloud • Machine charge = (# of machines) x (# hours) • Speed-up factors w/ more machines (i.e. more parallelism) • Add machines to meet soft deadlines • Aggressive batching to minimize machine charge (efficiency)

  5. App 1 App 2 App 3 Advanced workflow: Nova Simple workflow:Oozie Dataflow: Pig Processing: Hadoop M-R Storage: HDFS Nova Workflow Platform • What is Nova? • Content mgmt and workflow scheduling for the Cloud • Leverages existing resources • Cloud Data: HDFS/Zebra storage • Cloud Computing: Oozie, Pig/MR/Hadoop • Users define complex workflows in Oozie that consume the data Sample Pig A = load ‘input1' as (a, b, c); B = filter A by a > 5; store B into 'output1'; C = group B by b; store C into 'output2'; Oozie Workflow engine for coordinating Map-Reduce/Pig jobs in Hadoop (i.e. Workflow DAG in which nodes are MR tasks and edges are dataflows)

  6. Sample Nova Workflow crawler output crawled pages candidate entity occurrences Nova Data Nova Data Nova Data Nova Data (url, content) (url, entity string) entities Nova Tasks Nova Task Nova Task validated entity occurrences cand. entity extractor join group-wise count (entity id, entity string) (url, entity id) Nova Data entity occurrence counts editors (entity id, count)

  7. Nova Workflow 1 c1s0 c3s0 Nova Workflow 1.2 (scans c2s0 once) c2s0 c4s0 c1s0 c3s0 Workflow Merger Nova Workflow 2 c4s0 c2s0 c5s0 c5s0 c2s0 Shared Scan via Workflow Merging Output Data Input Data Pig/MR Tasks Sample Use Cases in Nova • Concurrent research, production, maintenance workflows over same data • Content enrichment workflows (i.e. dedup, clustering) over news content • Webmap workflows consuming same URL table

  8. Input Data Performance Impact (1) Shared Loading (network, redundant proc.) (2) Consolidated computation (shared startup/tear down) (3) Reducer parallelism (Max/Sum # of reducers) (1) Split(Tuple) Nested Plan Nested Plan … (2) Nested Plan Map1 Map2 Mapn Combine(Tuple) Shuffle Demux(Tuple) … Nested Plan (3) Nested Plan Nested Plan Reduce1 Reduce2 Reducem Output Data Output Data Output Data Output Data Output Data Output Data Output Data Output Data Output Data

  9. Completion Time by Scheduling Strategy Performance in Nova for different enrichment workflows (ie. de-dup) on news content (SIGMOD’11)

  10. Utilization of Grid Resources (Slot Time)

  11. PigMix: Load Cost Savings

  12. PigMix: Estimating Makespan

  13. Ongoing Work • Starvation resistance • Account for heterogeneity in workflow sizes • Provide soft deadline guarantees • Handling cascading failures • Prefer jobs with high load cost (less dilation, high slot time savings, map-only jobs) • Predicting workflow runtime and frequency • Robustness to inaccuracies in cost estimates • Conserve or expend Cloud resources based on deadline requirements and system load • Jobs that join/scan multiple input sources

  14. Questions?

  15. Nova Workflow Platform • Nova features • Abstraction for complex workflows that consume data • Incrementally arriving data (logs, crawls, feeds, ...) • Incremental processing of arriving data • Stateless: shingle every newly-crawled page • Stateful: maintain inlink counts as web grows • Scheduling processing steps • Periodic: run inlink counter once per week • Triggered: run inlink counter after link extractor • Provides provenance, metadata management, incremental processing (i.e. joins), data replication, transactional guarantees

  16. PigMix: Reducer Parallelism

  17. Optimizing for Shared Scan • Define a job J (i.e. MapReduce or Pig) • Scans files f(J) = (F1, …, Fi), scan time per file: s(Fi) • Fixed processing cost c(J) • d(J) defines a soft deadline of each job • Step: d defined by n pairs of (ti, pi) where 0<ti< ti+1and pi>pi+1 (a job that completes by ti is award pi points) • Linearly decay: enforce eventual completion w/ negative pts • Cost of shared scan for Jobs J1and J2 c(J1) + c(J2) + ∑Fє(f(J1) U f(J2)) s(F) • Maximize points and minimize resources • Local improvement w/ simulated annealing, greedy ordering • Aggressive batching when load is high

  18. Performance Evaluation • Experimental Setup • Nova with Shared Scan Module • 200 node Hadoop cluster • 128MB HDFS block size • 1GB RAM per node • 640 mapper and 320 reducer slots • Shingling workflow (offline content enrichment) • De-duplication of news • Filter and extract features from content • Cluster content by feature and pick one per cluster • Execution of multiple de-dup workflows using different clustering alg. • Scheduling strategies compared • Sequential-NoMerge (slower, conserve Grid resources) • Concurrent-NoMerge (fast, elastic Grid resources) • Merged (fast, conserve Grid resources)

More Related