1 / 72

MapReduce and Spark for Distributed Data Analytics

This course covers topics in network systems, including MapReduce and Spark for distributed data analytics. Topics include cluster scheduling, delay scheduling, fine-grained dataflow programming, and more.

millerhenry
Download Presentation

MapReduce and Spark for Distributed Data Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS434/534: Topics in Network SystemsMapReduce Cluster Scheduling; Delay Scheduling; Distributed Processing Beyond MapReduce: SparkYang (Richard) YangComputer Science DepartmentYale University208A WatsonEmail: yry@cs.yale.eduhttp://zoo.cs.yale.edu/classes/cs434/ Acknowledgement: slides contain content from conferencepresentations by authors of Delay Scheduling and Spark.

  2. Outline • Admin and recap • Cloud data center (CDC) applications/services • Fine-grained dataflow programming (e.g., Web apps) • Coarse-grained dataflow (e.g., data analytics) • Data storage: Google File System (GFS) • Data analytics programming using MapReduce • Data analytics programming using Spark

  3. Admin • Project meetings • Thursday: 1:30-3:00 • Friday: 1:30-3:30 • Exam: date? • Remaining topics to cover?

  4. Recap: Basic (Google) DA Software Architecture • How to store a huge amount of data? • How to process and extract something from the data? • How to multiple availability and consistency? • How to preserve the data privacy? GoogleFileSystem&BigTable thedata? MapReduce labilityandco Paxos

  5. Recap: GFS Architecture Data 1 2 1 3 1 2 3 2 3 Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

  6. Recap: GFS Insights Key Features • Storage • Separation of data and metadata • Huge files -> 64 MB for each chunk -> fewer chunks • Reduce client-master interaction and metadata size • Multiple replicas to ensure availability • Read/write • Read: multiple replicas to improve read tput, master can choose the nearest replicas for the client read • Separation of data flow (pipelining) and control flow in write (ordering and commit) to achieve high concurrency, but still ordering

  7. Recap: MapReduce Processing Example • A toy problem: The word count • ~ 10 billion documents • Average document’s size is 20KB => 10 billion docs = 200TB for each document d for each line in d for each word win line word_count[w]++; // parallel for each chunk for each chunk c for each document d in c for each line l in d for each word win line word_count[w]++; Problem: need to merge results from each chunk.

  8. Processing Example: Parallel Solution Merger 1 P1 the: 1 yale: 2 … PM P1 happy: 10 the: 10 … PM Merger M The result from each chunk be partitioned (called shuffled) to multiple mergers (called reduced):

  9. Recap: Generic MapReduce Programing Model • InspiredfrommapandreduceoperationscommonlyusedinfunctionalprogramminglanguagelikeLISP Reduce(k’,v’[])-->v’’ Map(k,v)-->(k’,v’) Group(k’,v’)sbyk’ Reduce Output Map Input

  10. Map + Reduce Fig assumes a single mapper, single reducer. What if multiple mappers and multiple reducers.?

  11. Simple Exercise: Stat • Compute sales statistics for a giant retailer say Walmart: total sale of each category • Per sale transaction<sale_id> -> time, item id, category, unit price, #items, city, state, … • Mapper? • Reducer?

  12. Let’s useMapReduceto help GoogleMap India Wewanttocomputetheaveragetemperatureforeachstate

  13. Let’s useMapReduceto help GoogleMap Wewanttocomputetheaveragetemperatureforeachstate

  14. Let’s useMapReduceto help GoogleMap MP:75 CG:72 OR:72

  15. Let’suseMapReduceto help GoogleMap

  16. Let’suseMapReduceto help GoogleMap

  17. Let’suseMapReduceto help GoogleMap

  18. Let’suseMapReduceto help GoogleMap

  19. Let’suseMapReduceto help GoogleMap

  20. Exercise: PageRank[Sergey Brin and Larry Page, 1998] • Problem: many Web pages may contain the searched key word (e.g., Yale), how to rank the pages when displaying search results? • Basic PageRank™ idea • The 10-90 rule • 10% of the time surfer types a random page • 90% of the time surfer clicks (follows) a random link on a given page • PageRank ranks pages according to frequencies (we call the pageranks) surfer visits the pages

  21. Round-Based PageRank • Initialize arbitrary page ranks • Iterative algorithm to simulate visit redistribution • Assume current round page rank of page pi is PRc(pi) • Update next round p1 x p2 … pn

  22. Exercise: PageRank • What a mapreduce for PageRank looks like?

  23. Simple Design 1 url -> pr, n, outlinks map( key: url, value: pr, n, outlinks ) for each outlink in outlinks emit( key: outlink, value: pr/n ) reducer( key: url, value: prs ) pra = 0 for each pr in prspra += pr pr = 0.1 / N + 0.9 * pra • emit( key: url, value: pr, n, outlinks )

  24. Simple Design (revision) url -> pr, n, outlinks map( key: url, value: pr, n, outlinks ) for each outlink in outlinks emit( key: outlink, value: pr/n ) emit( key: url, value: (n, outlinks) ) reducer( key: url, value: prs ) pra = 0 for each item in prs • if is_num(item) pra += pr • else (n, outlinks) = item pr = 0.1 / N + 0.9 * pra • emit( key: url, value: pr, n, outlinks )

  25. Outline • Admin and recap • Cloud data center (CDC) applications/services • Fine-grained dataflow (e.g., Web apps) • Coarse-grained dataflow (e.g., data analytics) • Data storage: Google File System (GFS) • Data analytics programming using MapReduce • MapReduce programming model • MapReduce cluster scheduling

  26. Basic MapReduce Architecture: Per Job Scheduling • Twocorecomponents for each job • JobTracker:assigningtaskstodifferentworkers • TaskTracker:executingmapandreduceprograms JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

  27. Basic MapReduce Architecture: Per Job Scheduling JobTracker TaskTracker (map) TaskTracker (reduce) TaskTracker (reduce) TaskTracker (map) TaskTracker (map) GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

  28. MapReduce Architecture: Bigger Picture • Multiple jobs run concurrently in the same cluster • A master coordinates resource allocation among jobs • Each node in cluster has a fixed number of map slots and reduce slots in which it can run tasks • Typically, administrators set the number of slots to one or two per core • Each cluster node sends heartbeats every few seconds to master to report its number of free map and reduce slots • Master assigns tasks from jobs to free slots

  29. Discussion • What should the master consider when assigning slots to jobs?

  30. Discussion: Fairness Scheduling Realization • Assume total S slots and assign each job j Fj slots • Assume event driven programming—assigning each free slot upon available • How may you realize the policy? F1 F2 FN

  31. Virtual Clock: A Conceptual Model • Basic idea • Compute the finishing time of each job according to its share of resource—imagine that it is running in a dedicated infrastructure • Sort jobs according to finishing time from above F1 K1 F2 K2 FN KN

  32. Scheduling Alg 1 fairness locality

  33. Issue of Alg 1: Head of Line Scheduling—Low Locality for Small Jobs) Discussion: intuition on low locality. Discussion: any small revision to improve locality?

  34. Scheduling Alg 2

  35. Locality Benefit of Delay • Assume: • Job j is the at the top of the sorted list • pj = |Pj|/M: fraction of machines w/ a chunk for job j • all tasks are homogeneous Probability j does not get a local slot: Ex: Assume a job has data on 10% machines and D = 40. Probability launching a local task?

  36. Computing D • Assume • Cluster has M node, each node has L slots • A job has N tasks • Each chunk is replicated R times • Goal: achieve locality target • Choose D so that the expected locality l(D) for an N-task job is higher than

  37. Computing D • Consider j has N, N-1, …, 1 tasks left to launch. Assume current #remaining tasks is K • A random node has no chunk (when R = 1): • Assume R random replication copies: (but no chunk) • Locality probability of one try pj = • Locality probability after D: 1- 1- =1-

  38. Computing D l(D) = =1- >=1- =1- 1- D

  39. Numerical Result D • Lambda = 0.95 • N = 20 • R = 3 • D/M = 0.23 • Time for a local task to launch • Each task takes T sec • Total S (=LM) slots • Total slots arrivals in 1 second: • Time to wait for D slots: • Assume L = 8

  40. Alg 3: Hadoop Fair Scheduler • Typically a cluster schedules resources not at job level, but at pools (production, accounting) level

  41. Hadoop Fair Scheduler (Alg 3)

  42. Benefits of MapReduce Programming for DA • By factoring out commonly required components of parallel DA systems, it enables application-specific code to remain concise • Very simple, clean design

  43. Discussion: MapReduce Programming for DA • What types of applications may not work well using MapReduce (or more limitations of the MD programming model we see so far)?

  44. Outline • Admin and recap • Cloud data center (CDC) applications/services • Fine-grained dataflow programming (e.g., Web apps) • Coarse-grained dataflow (e.g., data analytics) • Data storage: Google File System (GFS) • Data analytics programming using MapReduce • Data analytics programming using Spark

  45. Motivation: Performance • MapReduce does not fare well for applications that reuse a particular data set (working set) across multiple parallel operations (why?) • iterative algorithms (e.g., page rank) • interactive applications Discussion: How may you do?

  46. Motivation: Usability • MapReduce programming abstraction too coarse grained (only two basic APIs) reducer( key: url, value: pr_or_outlinks ) for each item in pr_or_outlinks if is_pr( item ) pr += item else (n, outlinks) = item pr = 0.1 / n + 0.9 * pr emit( key: url, value: pr, n, outlinks ) url -> pr, n, outlinks map( key: url, value: pr, n, outlinks ) for each outlink in outlinks emit( key: outlink, value: pr/n ) emit( key: url, value: (n, outlinks) )

  47. Spark • Improve MapReduce efficiency: in-memory, multi-step, coarse-grained computation graphs using high-level operators • Claims 100x faster • Improves MapReduce: usability: richer APIs in Scala, Java, Python • Claims 2.5 times less code

  48. Spark Programming Basic Structure • From only allowing a job to specify only two steps (map, reduce) to multiple processing steps • Processing units are called Resilient Distributed Datasets (RDDs) • An RDD is a collection of Java or Python objects partitioned across a cluster • A collection of statically typed objects parameterized by an element type, e.g., RDD[Int] is an RDD of integers • Two ways of creating RDDs • From a file (e.g., hdfs file chunks) • Transform an existing RDD (processing steps) • Transformation embeds in a program naturally

  49. Spark Example

  50. Transformations/Action  some operations, such as join , are only available on RDDs of key-value pairs.

More Related