1 / 44

Distributed Systems CS 15-440

Explore modern programming models like MapReduce, Pregel, and GraphLab that relieve programmers from the complexities of developing distributed programs. Learn about the MapReduce analytics engine and its basics, fault tolerance, and task and job scheduling.

Download Presentation

Distributed Systems CS 15-440

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed SystemsCS 15-440 Programming Models- Part III Lecture 16, Oct 24, 2014 Mohammad Hammoud

  2. Today… • Last Session • Programming Models- Part II: MPI • Today’s Session • Programming Models – Part III: MapReduce

  3. Objectives Discussion on Programming Models MapReduce, Pregel and GraphLab MapReduce, Pregel and GraphLab Message Passing Interface (MPI) Types of Parallel Programs Traditional Models of parallel programming Parallel computer architectures Why parallelizing our programs?

  4. The MapReduce Analytics Engine MapReduce Basics A Closer Look Fault-Tolerance Combiner Functions Task & Job Scheduling

  5. On the Verge of A Disruptive Century : Breakthroughs Ubiquitous Computing Gene Sequencing and Biotechnology Smaller, Faster, Cheaper Sensors Faster Communication

  6. A Common Theme is Data

  7. We Live in a World of Data… 72.9 Items Ordered /S @ Amazon 24 PB/ Day @ Google 50 Million Tweets/Day 2.9 Million Emails/S

  8. What Do We Do With Data? We want to do these seamlessly...

  9. Modern Programming Models • Recently, modern programming models were developed to: • Relieve programmers from concerns with many of the difficult aspects of developing distributed programs • Allow programmers to focus on ONLY the sequential portions of their applications’ algorithms • Examples of modern programming models • Hadoop MapReduce • Google’s Pregel • CMU’s GraphLab

  10. Modern Programming Models • Recently, modern programming models were developed to: • Relieve programmers from concerns with many of the difficult aspects of developing distributed programs • Allow programmers to focus on ONLY the sequential portions of their applications’ algorithms • Examples of modern programming models • Hadoop MapReduce • Google’s Pregel • CMU’s GraphLab

  11. Hadoop MapReduce • MapReduce is one of the successful realizations of large-scale “data-parallel” programming frameworks • Hadoop is an open source implementation of Google’s MapReduce • Hadoop incorporates two components: • MapReduce • Hadoop Distributed File System (HDFS)

  12. Problem Scope • MapReduce is a programming model for Big Data processing • The power of MapReduce lies in its ability to scale to 100s and 1000s of machines, each with several processor cores • How large an amount of work? • Web-Scale data on the order of 100s of GBs, TBs or PBs • It is likely that the input data set will not fit on a single computer • Hence, a distributed file system (e.g., HDFS) is typically required

  13. Where to Store Big Data? • In a MapReduce cluster, data is distributed to all cluster nodes upon loading • HDFS splits large data files into chunks which are managed by different nodes in the cluster • Even though the file chunks (or blocks) are distributed across several machines, they are managed under the same namespace Input data: A large file Node 1 Node 2 Node 3 Chunk1 of input data Chunk3 of input data Chunk5 of input data Chunk2 of input data Chunk4 of input data Chunk5 of input data

  14. Commodity Clusters • MapReduce is designed to efficiently process Big Data using regular commodity computers • A theoretical 1000-CPU machine would cost a very large amount of money, far more than 1000 single-CPU or 250 quad-core commodity machines • Premise: MapReduceties smaller and more reasonably priced machines together into a single cost-effective commodity cluster to solve Big Data problems

  15. Cluster Network • MapReduceassumes a tree-style, master-slave network topology • Nodes are spread over different racks contained in one or many data centers • A salient point is that the bandwidth between two nodes is dependent on their relative locations in the network topology • For example, nodes that are on the same rack will have higher bandwidth between them as opposed to nodes that are off-rack

  16. Computing Units: Tasks • MapReduce divides the workload into multiple independent tasks and automatically schedule them on cluster nodes • A work performed by each task is done in isolation from one another • The amount of communication which can be performed by tasks is limited mainly for scalability and fault-tolerance reasons • The communication overhead required to keep the data on the nodes synchronized at all times would prevent the model from performing reliably and efficiently at large scale

  17. MapReduce: A Systems View Map Task Split 0 HDFS BLK Partition Reduce Task Partition Partition Partition Partition Map Task Partition Split 1 HDFS BLK Dataset Partition Reduce Task To HDFS Partition Partition Partition Map Task Split 2 HDFS BLK HDFS Partition Partition Reduce Task Partition Partition Partition Map Task Split 3 HDFS BLK Partition Partition Merge Stage Shuffle Stage Reduce Stage Map Phase Reduce Phase

  18. Data Structure: Keys and Values • The programmer in MapReduce has to specify two functions, the Map functionand the Reduce functionthat implement the Mapper and the Reducer in a MapReduce program • In MapReduce, data elements are always structured as key-value (i.e., (K, V)) pairs • The Map and Reduce functions receive and emit (K, V) pairs Input Splits Intermediate Outputs Final Outputs (K, V) Pairs (K’, V’) Pairs (K’’, V’’) Pairs Map Function Reduce Function

  19. MapReduce: An Application View A Map Function A Chunk of File Mohammad is delivering a lecture to the 15-440 class Parse & Count A Reduce Function A Text File Mohammad is delivering a lecture to the 15-440 class The course name of 15-440 is Distributed Systems Iterate& Sum A Map Function A Chunk of File The course name of 15-440 is Distributed Systems Parse & Count

  20. The MapReduce Analytics Engine MapReduce Basics Basics A Closer Look Fault-Tolerance Combiner Functions Task & Job Scheduling

  21. Hadoop MapReduce: A Closer Look Node 1 Node 2 Files loaded from local HDFS store Files loaded from local HDFS store InputFormat InputFormat file file Split Split Split Split Split Split file file RR RR RR RR RR RR RecordReaders RecordReaders Input (K, V) pairs Input (K, V) pairs Map Map Map Map Map Map Intermediate (K, V) pairs Intermediate (K, V) pairs Shuffling Process Intermediate (K,V) pairs exchanged by all nodes Partitioner Partitioner Sort Sort Reduce Reduce Final (K, V) pairs Final (K, V) pairs OutputFormat OutputFormat Writeback to HDFS store Writeback to HDFS store

  22. Input Files • Input files are where the data for Map tasks are initially stored • The input files typically reside in a distributed file system (e.g. HDFS) • The format of input files is arbitrary • Line-based log files • Binary files • Multi-line input records • Or something totally different file file

  23. InputFormat • How the input files are split up and read is defined by the InputFormat • InputFormat is a Hadoop class that does the following: • Selects the files that should be used for input • Defines the InputSplits that break a file • Provides a factory for RecordReader objects thatread the file Files loaded from local HDFS store InputFormat file file

  24. InputFormat Types • Several InputFormats are provided with Hadoop:

  25. Input Splits • An input split describes a unit of work that comprises a single Map task in a MapReduce program • By default, the InputFormat breaks a file up into 64MB splits • Each Map task corresponds to a single input split • By dividing the file into splits, we allow several Map tasks to operate on a single large file in parallel • If the file is very large, this can improve performance significantly Files loaded from local HDFS store InputFormat file Split Split Split file

  26. RecordReader • The input split defines a slice of work but does not describe how to access it • The RecordReader class loads data from its source and converts it to (K, V) pairs suitable for reading by Mappers • The RecordReader is invoked repeatedly on the input until the entire split is consumed • Each invocation of the RecordReader leads toanother call of the user-defined map function Files loaded from local HDFS store InputFormat file Split Split Split file RR RR RR

  27. Mappers and Reducers • The Mapper performs the user-defined work within the first phase of the MapReduce program (i.e., the Map phase) • Mappers process splits • The Reducer performs the user-defined work within the second phase of the MapReduce program (i.e., the Reduce phase) • Reducers process partitions • For each key in the partition assigned to a Reducer, the Reducer is triggered once Files loaded from local HDFS store InputFormat file Split Split Split file RR RR RR Map Map Map Partitioner Sort Reduce

  28. Partitioner • Each mapper may emit (K, V) pairs to any Reduce task • Therefore, the map nodes must all agree on where to send different pieces of intermediate output • The Partitioner class determines which Reduce task (or partition) a given (K,V) pair willbe assigned to • The default Partitioner computes a hash value for a givenkey and assigns it to a partition based on this result Files loaded from local HDFS store InputFormat file Split Split Split file RR RR RR Map Map Map Partitioner Sort Reduce

  29. Sort Files loaded from local HDFS store • Each Reduce task is responsible for reducing the values associated with (several) intermediate keys • Intermediate keys of a single partition are automatically merged & sorted by the MapReduce engine before they are presented to the respective Reduce task • Reduce tasks apply the user-definedreduce function on merged and sorted partitions andoutput the final results • The Reduce task can be viewed as consisting of threestages, the Shuffle, the Merge & Sort, and the Reduce stages InputFormat file Split Split Split file RR RR RR Map Map Map Partitioner Sort Reduce

  30. OutputFormat Files loaded from local HDFS store • The OutputFormat class defines the way (K,V) pairs produced by Reducers are written to output files • The sub-classes of OutputFormat provided by Hadoopallow writing results to either the local disk or HDFS • Several OutputFormats are provided by Hadoop: InputFormat file Split Split Split file RR RR RR Map Map Map Partitioner Sort Reduce OutputFormat

  31. The MapReduce Analytics Engine MapReduce Basics Basics A Closer Look A Closer Look Fault-Tolerance Combiner Functions Task & Job Scheduling

  32. Combiner Functions • MapReduce applications are limited by the bandwidth available on the cluster • It pays off to minimize the data shuffled between Map and Reduce tasks • Hadoop allows users to specify combiner functions (just like reduce functions) to be run on Map outputs (Y, T) Combiner output Map output MT (1950, 0) (1950, 20) (1950, 10) (1950, 20) MT N MT R LEGEND: • R = Rack • N = Node • MT = Map Task • RT = Reduce Task • Y = Year • T = Temperature N MT RT MT N MT R N MT

  33. The MapReduce Analytics Engine MapReduce Basics Basics A Closer Look A Closer Look Fault-Tolerance Combiner Functions Combiner Functions Task & Job Scheduling

  34. Task Scheduling in MapReduce • MapReduce adopts a master-slave architecture • The master node is referred to as JobTracker(JT) • Each slave node is referred to as TaskTracker(TT) • MapReduce adopts a pull-based scheduling strategy (rather than a push-based one) • I.e., JT does not push Map and Reduce tasks to TTs but rather TTs pull them by making pertaining requests TT Task Slots Request JT Reply Tasks Queue Reply T0 T1 T0 T1 T2 Request TT Task Slots

  35. Map and Reduce Task Scheduling • Every TT sends a heartbeat message periodically to JT encompassing a request for a Map or a Reduce task • Map Task Scheduling: • JT satisfies requests for Map tasks via attempting to schedule Map tasks in the vicinity of their input splits (i.e., it exploits data locality) • Reduce Task Scheduling: • However, JT simply assigns the next yet-to-run Reduce task to a requesting TT regardless of TT’s network location and its implied effect on the reducer’s shuffle time (i.e., it does not exploit data locality)

  36. Job Scheduling in MapReduce • In MapReduce, an application is represented by one or many jobs • A job consists of one or many Map and Reduce tasks • HadoopMapReduce comes with various choices of job schedulers: • FIFO Scheduler: schedules jobs in order of submission • Fair Scheduler: aims at giving every user a “fair” share of the cluster capacity over time • Capacity Scheduler: Similar to Fair Scheduler but does not apply job preemption

  37. The MapReduce Analytics Engine MapReduce Basics Basics A Closer Look A Closer Look Fault-Tolerance Combiner Functions Combiner Functions Task & Job Scheduling Task & Job Scheduling

  38. Fault Tolerance in Hadoop: Node Failures • MapReduce can guide jobs toward a successful completion even when jobs are run on large clusters where probability of failures increases • HadoopMapReduce achieves fault-tolerance through restarting tasks • If a TT fails to communicate with JT for a period of time (by default, 1 minute), JT will assume that TT in question has crashed • If the job is still in the Map phase, JT asks another TT to re-execute all Map tasks that previously ran at the failed TT • If the job is in the Reduce phase, JT asks another TT to re-execute all Reduce tasks that were in-progress on the failed TT

  39. Fault Tolerance in Hadoop: Speculative Execution • A MapReduce job is dominated by the slowest task • MapReduce attempts to locate slow tasks (or stragglers) and run replicated (or speculative) tasks that will optimistically commit before the corresponding stragglers • In general, this strategy is known as task resiliency or task replication(as opposed to data replication), but in Hadoop it is referred to as speculative execution • Only one copy of a straggler is allowed to be speculated • Whichever copy (among the two copies) of a task commits first, it becomes the definitive copy, and the other copy is killed by JT

  40. But, How to Locate Stragglers? • Hadoop monitors each task progress using a progress score between 0 and 1 • If a task’s progress score is less than (average – 0.2), and the task has run for at least 1 minute, it is marked as a straggler • Not a straggler T1 PS= 2/3 A straggler T2 PS= 1/12 Time

  41. To this End…

  42. What Makes MapReduce Unique? • MapReduce is characterized by: • Its simplified programming model which allows the user to quickly write and test distributed systems • Its efficient and automatic distribution of data and workload across cluster machines • Its flat scalability curve • After a MapReduce program is written and executed on a 10-machine cluster, very little (if any) work is required to make the same program run on a 1000-machine cluster • Communication overhead is minimized as much as possible

  43. Comparison With Traditional Models

  44. Next Class Discussion on Programming Models MapReduce, Pregel and GraphLab Message Passing Interface (MPI) Types of Parallel Programs Traditional Models of parallel programming Parallel computer architectures Why parallelizing our programs?

More Related