MapReduce and Spark for Distributed Data Analytics

CS434/534: Topics in Network SystemsMapReduce Cluster Scheduling; Delay Scheduling; Distributed Processing Beyond MapReduce: SparkYang (Richard) YangComputer Science DepartmentYale University208A WatsonEmail: yry@cs.yale.eduhttp://zoo.cs.yale.edu/classes/cs434/ Acknowledgement: slides contain content from conferencepresentations by authors of Delay Scheduling and Spark.

Outline • Admin and recap • Cloud data center (CDC) applications/services • Fine-grained dataflow programming (e.g., Web apps) • Coarse-grained dataflow (e.g., data analytics) • Data storage: Google File System (GFS) • Data analytics programming using MapReduce • Data analytics programming using Spark

Admin • Project meetings • Thursday: 1:30-3:00 • Friday: 1:30-3:30 • Exam: date? • Remaining topics to cover?

Recap: Basic (Google) DA Software Architecture • How to store a huge amount of data? • How to process and extract something from the data? • How to multiple availability and consistency? • How to preserve the data privacy? GoogleFileSystem&BigTable thedata? MapReduce labilityandco Paxos

Recap: GFS Architecture Data 1 2 1 3 1 2 3 2 3 Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

Recap: GFS Insights Key Features • Storage • Separation of data and metadata • Huge files -> 64 MB for each chunk -> fewer chunks • Reduce client-master interaction and metadata size • Multiple replicas to ensure availability • Read/write • Read: multiple replicas to improve read tput, master can choose the nearest replicas for the client read • Separation of data flow (pipelining) and control flow in write (ordering and commit) to achieve high concurrency, but still ordering

Recap: MapReduce Processing Example • A toy problem: The word count • ~ 10 billion documents • Average document’s size is 20KB => 10 billion docs = 200TB for each document d for each line in d for each word win line word_count[w]++; // parallel for each chunk for each chunk c for each document d in c for each line l in d for each word win line word_count[w]++; Problem: need to merge results from each chunk.

Processing Example: Parallel Solution Merger 1 P1 the: 1 yale: 2 … PM P1 happy: 10 the: 10 … PM Merger M The result from each chunk be partitioned (called shuffled) to multiple mergers (called reduced):

Recap: Generic MapReduce Programing Model • InspiredfrommapandreduceoperationscommonlyusedinfunctionalprogramminglanguagelikeLISP Reduce(k’,v’[])-->v’’ Map(k,v)-->(k’,v’) Group(k’,v’)sbyk’ Reduce Output Map Input

Map + Reduce Fig assumes a single mapper, single reducer. What if multiple mappers and multiple reducers.?

Simple Exercise: Stat • Compute sales statistics for a giant retailer say Walmart: total sale of each category • Per sale transaction<sale_id> -> time, item id, category, unit price, #items, city, state, … • Mapper? • Reducer?

Let’s useMapReduceto help GoogleMap India Wewanttocomputetheaveragetemperatureforeachstate

Let’s useMapReduceto help GoogleMap Wewanttocomputetheaveragetemperatureforeachstate

Let’s useMapReduceto help GoogleMap MP:75 CG:72 OR:72

Let’suseMapReduceto help GoogleMap

Exercise: PageRank[Sergey Brin and Larry Page, 1998] • Problem: many Web pages may contain the searched key word (e.g., Yale), how to rank the pages when displaying search results? • Basic PageRank™ idea • The 10-90 rule • 10% of the time surfer types a random page • 90% of the time surfer clicks (follows) a random link on a given page • PageRank ranks pages according to frequencies (we call the pageranks) surfer visits the pages

Round-Based PageRank • Initialize arbitrary page ranks • Iterative algorithm to simulate visit redistribution • Assume current round page rank of page pi is PRc(pi) • Update next round p1 x p2 … pn

Exercise: PageRank • What a mapreduce for PageRank looks like?

Simple Design 1 url -> pr, n, outlinks map( key: url, value: pr, n, outlinks ) for each outlink in outlinks emit( key: outlink, value: pr/n ) reducer( key: url, value: prs ) pra = 0 for each pr in prspra += pr pr = 0.1 / N + 0.9 * pra • emit( key: url, value: pr, n, outlinks )

Simple Design (revision) url -> pr, n, outlinks map( key: url, value: pr, n, outlinks ) for each outlink in outlinks emit( key: outlink, value: pr/n ) emit( key: url, value: (n, outlinks) ) reducer( key: url, value: prs ) pra = 0 for each item in prs • if is_num(item) pra += pr • else (n, outlinks) = item pr = 0.1 / N + 0.9 * pra • emit( key: url, value: pr, n, outlinks )

Outline • Admin and recap • Cloud data center (CDC) applications/services • Fine-grained dataflow (e.g., Web apps) • Coarse-grained dataflow (e.g., data analytics) • Data storage: Google File System (GFS) • Data analytics programming using MapReduce • MapReduce programming model • MapReduce cluster scheduling

Basic MapReduce Architecture: Per Job Scheduling • Twocorecomponents for each job • JobTracker:assigningtaskstodifferentworkers • TaskTracker:executingmapandreduceprograms JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

Basic MapReduce Architecture: Per Job Scheduling JobTracker TaskTracker (map) TaskTracker (reduce) TaskTracker (reduce) TaskTracker (map) TaskTracker (map) GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

MapReduce Architecture: Bigger Picture • Multiple jobs run concurrently in the same cluster • A master coordinates resource allocation among jobs • Each node in cluster has a fixed number of map slots and reduce slots in which it can run tasks • Typically, administrators set the number of slots to one or two per core • Each cluster node sends heartbeats every few seconds to master to report its number of free map and reduce slots • Master assigns tasks from jobs to free slots

Discussion • What should the master consider when assigning slots to jobs?

Discussion: Fairness Scheduling Realization • Assume total S slots and assign each job j Fj slots • Assume event driven programming—assigning each free slot upon available • How may you realize the policy? F1 F2 FN

Virtual Clock: A Conceptual Model • Basic idea • Compute the finishing time of each job according to its share of resource—imagine that it is running in a dedicated infrastructure • Sort jobs according to finishing time from above F1 K1 F2 K2 FN KN

Scheduling Alg 1 fairness locality

Issue of Alg 1: Head of Line Scheduling—Low Locality for Small Jobs) Discussion: intuition on low locality. Discussion: any small revision to improve locality?

Scheduling Alg 2

Locality Benefit of Delay • Assume: • Job j is the at the top of the sorted list • pj = |Pj|/M: fraction of machines w/ a chunk for job j • all tasks are homogeneous Probability j does not get a local slot: Ex: Assume a job has data on 10% machines and D = 40. Probability launching a local task?

Computing D • Assume • Cluster has M node, each node has L slots • A job has N tasks • Each chunk is replicated R times • Goal: achieve locality target • Choose D so that the expected locality l(D) for an N-task job is higher than

Computing D • Consider j has N, N-1, …, 1 tasks left to launch. Assume current #remaining tasks is K • A random node has no chunk (when R = 1): • Assume R random replication copies: (but no chunk) • Locality probability of one try pj = • Locality probability after D: 1- 1- =1-

Computing D l(D) = =1- >=1- =1- 1- D

Numerical Result D • Lambda = 0.95 • N = 20 • R = 3 • D/M = 0.23 • Time for a local task to launch • Each task takes T sec • Total S (=LM) slots • Total slots arrivals in 1 second: • Time to wait for D slots: • Assume L = 8

Alg 3: Hadoop Fair Scheduler • Typically a cluster schedules resources not at job level, but at pools (production, accounting) level

Hadoop Fair Scheduler (Alg 3)

Benefits of MapReduce Programming for DA • By factoring out commonly required components of parallel DA systems, it enables application-specific code to remain concise • Very simple, clean design

Discussion: MapReduce Programming for DA • What types of applications may not work well using MapReduce (or more limitations of the MD programming model we see so far)?

Outline • Admin and recap • Cloud data center (CDC) applications/services • Fine-grained dataflow programming (e.g., Web apps) • Coarse-grained dataflow (e.g., data analytics) • Data storage: Google File System (GFS) • Data analytics programming using MapReduce • Data analytics programming using Spark

Motivation: Performance • MapReduce does not fare well for applications that reuse a particular data set (working set) across multiple parallel operations (why?) • iterative algorithms (e.g., page rank) • interactive applications Discussion: How may you do?

Motivation: Usability • MapReduce programming abstraction too coarse grained (only two basic APIs) reducer( key: url, value: pr_or_outlinks ) for each item in pr_or_outlinks if is_pr( item ) pr += item else (n, outlinks) = item pr = 0.1 / n + 0.9 * pr emit( key: url, value: pr, n, outlinks ) url -> pr, n, outlinks map( key: url, value: pr, n, outlinks ) for each outlink in outlinks emit( key: outlink, value: pr/n ) emit( key: url, value: (n, outlinks) )

Spark • Improve MapReduce efficiency: in-memory, multi-step, coarse-grained computation graphs using high-level operators • Claims 100x faster • Improves MapReduce: usability: richer APIs in Scala, Java, Python • Claims 2.5 times less code

Spark Programming Basic Structure • From only allowing a job to specify only two steps (map, reduce) to multiple processing steps • Processing units are called Resilient Distributed Datasets (RDDs) • An RDD is a collection of Java or Python objects partitioned across a cluster • A collection of statically typed objects parameterized by an element type, e.g., RDD[Int] is an RDD of integers • Two ways of creating RDDs • From a file (e.g., hdfs file chunks) • Transform an existing RDD (processing steps) • Transformation embeds in a program naturally

Spark Example

Transformations/Action some operations, such as join , are only available on RDDs of key-value pairs.

MapReduce and Spark for Distributed Data Analytics

MapReduce and Spark for Distributed Data Analytics

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: