720 likes | 728 Views
This course covers topics in network systems, including MapReduce and Spark for distributed data analytics. Topics include cluster scheduling, delay scheduling, fine-grained dataflow programming, and more.
E N D
CS434/534: Topics in Network SystemsMapReduce Cluster Scheduling; Delay Scheduling; Distributed Processing Beyond MapReduce: SparkYang (Richard) YangComputer Science DepartmentYale University208A WatsonEmail: yry@cs.yale.eduhttp://zoo.cs.yale.edu/classes/cs434/ Acknowledgement: slides contain content from conferencepresentations by authors of Delay Scheduling and Spark.
Outline • Admin and recap • Cloud data center (CDC) applications/services • Fine-grained dataflow programming (e.g., Web apps) • Coarse-grained dataflow (e.g., data analytics) • Data storage: Google File System (GFS) • Data analytics programming using MapReduce • Data analytics programming using Spark
Admin • Project meetings • Thursday: 1:30-3:00 • Friday: 1:30-3:30 • Exam: date? • Remaining topics to cover?
Recap: Basic (Google) DA Software Architecture • How to store a huge amount of data? • How to process and extract something from the data? • How to multiple availability and consistency? • How to preserve the data privacy? GoogleFileSystem&BigTable thedata? MapReduce labilityandco Paxos
Recap: GFS Architecture Data 1 2 1 3 1 2 3 2 3 Metadata GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
Recap: GFS Insights Key Features • Storage • Separation of data and metadata • Huge files -> 64 MB for each chunk -> fewer chunks • Reduce client-master interaction and metadata size • Multiple replicas to ensure availability • Read/write • Read: multiple replicas to improve read tput, master can choose the nearest replicas for the client read • Separation of data flow (pipelining) and control flow in write (ordering and commit) to achieve high concurrency, but still ordering
Recap: MapReduce Processing Example • A toy problem: The word count • ~ 10 billion documents • Average document’s size is 20KB => 10 billion docs = 200TB for each document d for each line in d for each word win line word_count[w]++; // parallel for each chunk for each chunk c for each document d in c for each line l in d for each word win line word_count[w]++; Problem: need to merge results from each chunk.
Processing Example: Parallel Solution Merger 1 P1 the: 1 yale: 2 … PM P1 happy: 10 the: 10 … PM Merger M The result from each chunk be partitioned (called shuffled) to multiple mergers (called reduced):
Recap: Generic MapReduce Programing Model • InspiredfrommapandreduceoperationscommonlyusedinfunctionalprogramminglanguagelikeLISP Reduce(k’,v’[])-->v’’ Map(k,v)-->(k’,v’) Group(k’,v’)sbyk’ Reduce Output Map Input
Map + Reduce Fig assumes a single mapper, single reducer. What if multiple mappers and multiple reducers.?
Simple Exercise: Stat • Compute sales statistics for a giant retailer say Walmart: total sale of each category • Per sale transaction<sale_id> -> time, item id, category, unit price, #items, city, state, … • Mapper? • Reducer?
Let’s useMapReduceto help GoogleMap India Wewanttocomputetheaveragetemperatureforeachstate
Let’s useMapReduceto help GoogleMap Wewanttocomputetheaveragetemperatureforeachstate
Let’s useMapReduceto help GoogleMap MP:75 CG:72 OR:72
Exercise: PageRank[Sergey Brin and Larry Page, 1998] • Problem: many Web pages may contain the searched key word (e.g., Yale), how to rank the pages when displaying search results? • Basic PageRank™ idea • The 10-90 rule • 10% of the time surfer types a random page • 90% of the time surfer clicks (follows) a random link on a given page • PageRank ranks pages according to frequencies (we call the pageranks) surfer visits the pages
Round-Based PageRank • Initialize arbitrary page ranks • Iterative algorithm to simulate visit redistribution • Assume current round page rank of page pi is PRc(pi) • Update next round p1 x p2 … pn
Exercise: PageRank • What a mapreduce for PageRank looks like?
Simple Design 1 url -> pr, n, outlinks map( key: url, value: pr, n, outlinks ) for each outlink in outlinks emit( key: outlink, value: pr/n ) reducer( key: url, value: prs ) pra = 0 for each pr in prspra += pr pr = 0.1 / N + 0.9 * pra • emit( key: url, value: pr, n, outlinks )
Simple Design (revision) url -> pr, n, outlinks map( key: url, value: pr, n, outlinks ) for each outlink in outlinks emit( key: outlink, value: pr/n ) emit( key: url, value: (n, outlinks) ) reducer( key: url, value: prs ) pra = 0 for each item in prs • if is_num(item) pra += pr • else (n, outlinks) = item pr = 0.1 / N + 0.9 * pra • emit( key: url, value: pr, n, outlinks )
Outline • Admin and recap • Cloud data center (CDC) applications/services • Fine-grained dataflow (e.g., Web apps) • Coarse-grained dataflow (e.g., data analytics) • Data storage: Google File System (GFS) • Data analytics programming using MapReduce • MapReduce programming model • MapReduce cluster scheduling
Basic MapReduce Architecture: Per Job Scheduling • Twocorecomponents for each job • JobTracker:assigningtaskstodifferentworkers • TaskTracker:executingmapandreduceprograms JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
Basic MapReduce Architecture: Per Job Scheduling JobTracker TaskTracker (map) TaskTracker (reduce) TaskTracker (reduce) TaskTracker (map) TaskTracker (map) GFS Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
MapReduce Architecture: Bigger Picture • Multiple jobs run concurrently in the same cluster • A master coordinates resource allocation among jobs • Each node in cluster has a fixed number of map slots and reduce slots in which it can run tasks • Typically, administrators set the number of slots to one or two per core • Each cluster node sends heartbeats every few seconds to master to report its number of free map and reduce slots • Master assigns tasks from jobs to free slots
Discussion • What should the master consider when assigning slots to jobs?
Discussion: Fairness Scheduling Realization • Assume total S slots and assign each job j Fj slots • Assume event driven programming—assigning each free slot upon available • How may you realize the policy? F1 F2 FN
Virtual Clock: A Conceptual Model • Basic idea • Compute the finishing time of each job according to its share of resource—imagine that it is running in a dedicated infrastructure • Sort jobs according to finishing time from above F1 K1 F2 K2 FN KN
Scheduling Alg 1 fairness locality
Issue of Alg 1: Head of Line Scheduling—Low Locality for Small Jobs) Discussion: intuition on low locality. Discussion: any small revision to improve locality?
Locality Benefit of Delay • Assume: • Job j is the at the top of the sorted list • pj = |Pj|/M: fraction of machines w/ a chunk for job j • all tasks are homogeneous Probability j does not get a local slot: Ex: Assume a job has data on 10% machines and D = 40. Probability launching a local task?
Computing D • Assume • Cluster has M node, each node has L slots • A job has N tasks • Each chunk is replicated R times • Goal: achieve locality target • Choose D so that the expected locality l(D) for an N-task job is higher than
Computing D • Consider j has N, N-1, …, 1 tasks left to launch. Assume current #remaining tasks is K • A random node has no chunk (when R = 1): • Assume R random replication copies: (but no chunk) • Locality probability of one try pj = • Locality probability after D: 1- 1- =1-
Computing D l(D) = =1- >=1- =1- 1- D
Numerical Result D • Lambda = 0.95 • N = 20 • R = 3 • D/M = 0.23 • Time for a local task to launch • Each task takes T sec • Total S (=LM) slots • Total slots arrivals in 1 second: • Time to wait for D slots: • Assume L = 8
Alg 3: Hadoop Fair Scheduler • Typically a cluster schedules resources not at job level, but at pools (production, accounting) level
Benefits of MapReduce Programming for DA • By factoring out commonly required components of parallel DA systems, it enables application-specific code to remain concise • Very simple, clean design
Discussion: MapReduce Programming for DA • What types of applications may not work well using MapReduce (or more limitations of the MD programming model we see so far)?
Outline • Admin and recap • Cloud data center (CDC) applications/services • Fine-grained dataflow programming (e.g., Web apps) • Coarse-grained dataflow (e.g., data analytics) • Data storage: Google File System (GFS) • Data analytics programming using MapReduce • Data analytics programming using Spark
Motivation: Performance • MapReduce does not fare well for applications that reuse a particular data set (working set) across multiple parallel operations (why?) • iterative algorithms (e.g., page rank) • interactive applications Discussion: How may you do?
Motivation: Usability • MapReduce programming abstraction too coarse grained (only two basic APIs) reducer( key: url, value: pr_or_outlinks ) for each item in pr_or_outlinks if is_pr( item ) pr += item else (n, outlinks) = item pr = 0.1 / n + 0.9 * pr emit( key: url, value: pr, n, outlinks ) url -> pr, n, outlinks map( key: url, value: pr, n, outlinks ) for each outlink in outlinks emit( key: outlink, value: pr/n ) emit( key: url, value: (n, outlinks) )
Spark • Improve MapReduce efficiency: in-memory, multi-step, coarse-grained computation graphs using high-level operators • Claims 100x faster • Improves MapReduce: usability: richer APIs in Scala, Java, Python • Claims 2.5 times less code
Spark Programming Basic Structure • From only allowing a job to specify only two steps (map, reduce) to multiple processing steps • Processing units are called Resilient Distributed Datasets (RDDs) • An RDD is a collection of Java or Python objects partitioned across a cluster • A collection of statically typed objects parameterized by an element type, e.g., RDD[Int] is an RDD of integers • Two ways of creating RDDs • From a file (e.g., hdfs file chunks) • Transform an existing RDD (processing steps) • Transformation embeds in a program naturally
Transformations/Action some operations, such as join , are only available on RDDs of key-value pairs.