1 / 43

CMSC 34702 ML for Cluster Scheduling (1)

CMSC 34702 ML for Cluster Scheduling (1). Junchen Jiang October 3 , 2019. Logistics. Signup on Piazza https://piazza.com/class/k15fawsrzma6ow Choose your paper to present Paper review format: Paper summary (Three sentences or less about the main idea, approach, or contribution.)

terrye
Download Presentation

CMSC 34702 ML for Cluster Scheduling (1)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CMSC 34702ML for Cluster Scheduling (1) Junchen Jiang October 3, 2019

  2. Logistics • Signup on Piazza • https://piazza.com/class/k15fawsrzma6ow • Choose your paper to present • Paper review format: • Paper summary (Three sentences or less about the main idea, approach, or contribution.) • Why we should accept the paper? (Please give 1-3 sentences for the 1-3 strongest things about the paper.) • Why we should not accept the paper? (Please give 1-3 sentences about the 1-3 things about the paper that would most improve it.)

  3. MapReduce: Simplified Data Processing on Large Clusters CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics

  4. Cloud Computing Basics

  5. Scaling up vs. Scaling out: Origin of Cloud Computing • Scale-up: High-end servers Sun Starfire, Enterprise, … ($1 million a piece) Used by eBay, Amazon, … • Scale-out: “Commercial Off-The-Shelf” (COTS) computers Many of them (Google had 15,000 of them c. 2004)

  6. Price/Performance Comparison (c. 2004) Higher performance and cheaper! Too good to be true?

  7. Disadvantages of a cluster of COTS nodes? Rack of COTS computers High-end server VS CPU RAM Disk

  8. New problems in distributed/cluster computing • Fault tolerance • Network traffic • Data consistency • Programming complexity • …

  9. Cluster Computing Needs a Software Stack Processing Database … Typical software analytics stack Data mngt Resource mngt MapReduce MapReduce Spark … … … Bigtable HBase Shark The Google File System (GFS) Hadoop File System (HDFS) Allexio Borg YARN Mesos Berkeley Hadoop Google

  10. MapReduce: Simplified Data Processing on Large Clusters  Cluster computing is popular, but it’s hard to write complex & high performant programs. The first to provide an expressive programming interface that automatically optimizes low-level system details.

  11. Why is parallelization difficult? If the initial state is x=6, y=0, what happens when these threads finish running? Thread 1 void foo(){ x ++; y = x; } Thread 2 void bar(){ y ++; x += 3; } Multithreading = Unpredictability (from https://www.youtube.com/watch?v=-vD6PUdf3Js)

  12. Functional Programming x++ y=x x x y f 6 X A f y++ Y B y y 0 x+=3 x x Functional Programming No mutable variable, No changing state No side effect States can change (not idempotent) Too many variable (interdependency)

  13. Key Functional Programming ops: map & fold f X X’ X f f Y Y’ Y f f Z Z’ Z f map fold

  14. MapReduce: An instantiation of “map” & “fold” reduce map (key_a, val_11) (key_b, val_12) (key_1, val_1) (key_a, R([val_11])) (key_b, R([val_12,val_21])) (key_c, R([val_22])) (key_b, val_21) (key_c, val_22) (key_2, val_2) Example: Count word occurrences “MapReduce: Simplified Data Processing on Large Clusters”, Jeff Dean, et al, OSDI’04

  15. Example: Count word occurrences reduce map (“personal”, 1) (“computer”, 1) (URL1, “personal computer”) (“personal”, 1) (“computer”, 2) (“science”, 1) (“computer”, 1) (“science”, 1) (URL2, “computer science”) “MapReduce: Simplified Data Processing on Large Clusters”, Jeff Dean, et al, OSDI’04

  16. Rationale behind the MapReduce Interface: A Minimalist Approach Google Search, Machine learning, Graph mining, Grep, Sort, Word Counting… Applications, Data Analytics Algorithms Application developers need to must all the intricacies of resource & comm. Map & Reduce Interface Cluster Computing System MapReduce System Can you think of another example of the minimalist approach?

  17. What’s the contribution of the MapReduce System? Make it easier to write parallel programs “MapReduce: Simplified Data Processing on Large Clusters”, Jeff Dean, et al, OSDI’04

  18. What’s the contribution of the MapReduce System? Make it easier to write parallel programs An implementation of the interface that achieves high performance • Fault tolerance • Data locality • Load balancing • Straggler mitigation • Consistency • Data integrity “MapReduce: Simplified Data Processing on Large Clusters”, Jeff Dean, et al, OSDI’04

  19. System Architecture “MapReduce: Simplified Data Processing on Large Clusters”, Jeff Dean, et al, OSDI’04

  20. Performance: Data locality Co-locate workers with the data Co-locate reducers with mappers “MapReduce: Simplified Data Processing on Large Clusters”, Jeff Dean, et al, OSDI’04

  21. Performance: Speeding up “Reducer” with “Combiner” When can “Combiner” help? “MapReduce: Simplified Data Processing on Large Clusters”, Jeff Dean, et al, OSDI’04

  22. Fault Tolerance Re-execute in-progress and completed map tasks What if a map worker? “MapReduce: Simplified Data Processing on Large Clusters”, Jeff Dean, et al, OSDI’04

  23. Fault Tolerance What if a reduce worker fails? Re-execute in-progress reduce tasks “MapReduce: Simplified Data Processing on Large Clusters”, Jeff Dean, et al, OSDI’04

  24. Fault Tolerance What if the master fails? Expose to the user “MapReduce: Simplified Data Processing on Large Clusters”, Jeff Dean, et al, OSDI’04

  25. MapReduce Summary • A minimalist approach • Many problems can be easily expressible by MapReduce primitives • Greatly simplifies fault tolerance & performance optimization • (Almost) complete transparent fault tolerance at a large scale • Dramatically ease the burden of programmers • Still need users to step-in in some cases…

  26. “Hyperparameters” of a cluster/cloud job How many physical machines? How much RAM, CPUs per machine? How much disk space? How much network bandwidth? …

  27. CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics   Cloud performance is sensitive to configurations, but there’s no way to pick configurations optimally, quickly and adaptively for any cloud job The first systematic technique to achieve these requirements through modeling the perf-config relationship via a blackbox ML technique

  28. Large space of cloud configurations Providers Machine Types Cluster Sizes r3.8xlarge, i2.8xlarge, m4.8xlarge, c4.8xlarge, r4.8xlarge, c3.8xlarge, Amazon AWS X X 10s~ options Microsoft Azure A0, A1, A2, A3, A11, A12, D1, D2, D3, … n1-standard-4, n1-highmem-2, n1-highcpu-4, … Google Cloud “CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics”, Omid Alipourfard, et al, NSDI’17

  29. Good configuration  High performance & Low cost 66 Configurations “CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics”, Omid Alipourfard, et al, NSDI’17

  30. Complex performance-configuration relationship “CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics”, Omid Alipourfard, et al, NSDI’17

  31. How to find the best cloud configuration One that minimizes the cost given a performance constraint for a recurring job, given its representative workload? “CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics”, Omid Alipourfard, et al, NSDI’17

  32. Key metrics of success “CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics”, Omid Alipourfard, et al, NSDI’17

  33. Strawmen • Exhaustive search • High overhead • Coordinate search • Optimize each config (CPU, RAM, disk, network, etc) • Not accurate (non-convex performance/cost curves across many resources) • Ernest [NSDI’16] • Learn a model for each job type • Not adaptive “CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics”, Omid Alipourfard, et al, NSDI’17

  34. Why CherryPick “CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics”, Omid Alipourfard, et al, NSDI’17

  35. Basic idea: Blackbox modeling Config-performance model Start with any config Run the config Blackbox modeling Choose next config Return config “CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics”, Omid Alipourfard, et al, NSDI’17

  36. Insight: No need to be accurate everywhere How about a model that predicts performance for any given configuration? Config-performance model Start with any config Run the config Blackbox modeling Choose next config Return config Insight: All we need is the top ranking (which one is the better). “CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics”, Omid Alipourfard, et al, NSDI’17

  37. Bayesian Optimization “CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics”, Omid Alipourfard, et al, NSDI’17

  38. Bayesian Optimization “CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics”, Omid Alipourfard, et al, NSDI’17

  39. How to pick the next configuration? “CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics”, Omid Alipourfard, et al, NSDI’17

  40. Why CherryPick works? “CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics”, Omid Alipourfard, et al, NSDI’17

  41. Does the "blackbox” behave reasonably? “CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics”, Omid Alipourfard, et al, NSDI’17

  42. Conclusion “CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics”, Omid Alipourfard, et al, NSDI’17

  43. Reminder • Signup on Piazza Need to post paper summaries there • Project proposal idea due in 12 days

More Related