250 likes | 277 Views
MapReduce Programming Model. HP Cluster Computing Challenges. Programmability: need to parallelize algorithms manually Must look at problems from parallel standpoint Tightly coupled problems require frequent communication (more of the slow part!) We want to decouple the problem
E N D
HP Cluster Computing Challenges • Programmability: need to parallelize algorithms manually • Must look at problems from parallel standpoint • Tightly coupled problems require frequent communication (more of the slow part!) • We want to decouple the problem • Increase data locality, Balance the workload, etc • Parallel efficiency: comm. is fundamental difficulty • Distributing data, updating shared resource, communicating results • Machines have separate memories, so no usual inter-process communication – need network • Introduces inefficiencies: overhead, waiting, etc.
Programming Models: What is MPI? • MPI: Message Passing Interface (www.mpi.org) • World’s most popular high perf. distributed API • MPI is “de facto standard” in scientific computing • C and FORTRAN, ver. 2 in 1997 • What is MPI good for? • Abstracts away common network communications • Allows lots of control without bookkeeping • Freedom and flexibility come with complexity • 300 subroutines, but serious programs with fewer than 10 • Basics: • One executable run on every node (SPMD: single program multiple data) • Each node process has a rank ID number assigned • Call API functions to send messages • Send/Receive of a block of data (in array) • MPI_Send(start, count, datatype, dest, tag, comm_context) • MPI_Recv(start, count, datatype, source, tag, comm-context, status)
Challenges with MPI • Poor programmability: • Socket programming, with support of data structures • Programmers need to take care of everything: data distribution, inter-proc comm, orchestration • Blocking comm can cause deadlock • Proc1: MPI_Receive(Proc2, A); MPI_Send(Proc2, B); • Proc2: MPI_Receive(Proc1, B); MPI_Send(Proc1, A); • Potential to high parallel efficiency, but • Large overhead from comm. mismanagement • Time spent blocking is wasted cycles • Can overlap computation with non-blocking comm. • Load imbalance is possible! Dead machines?
Google’s MapReduce • Large-Scale Data Processing • Want to use hundreds or thousands of CPUs... but this needs to be easy! • MapReduce provides • Automatic parallelization & distribution • Fault tolerance • I/O scheduling • Monitoring & status updates
Programming Concept • Map • Perform a function on individual values in a data set to create a newlist of values • Example: square x = x * x map square [1,2,3,4,5] returns [1,4,9,16,25] • Reduce • Combine values in a data set to create a new value • Example: sum = (each elem in arr, total +=) reduce [1,2,3,4,5] returns 15 (the sum of the elements)
Map/Reduce • map(key, val) is run on each item in set • emits new-key / new-val pairs • Processes input key/value pair • Produces set of intermediate pairs • reduce(key, vals)is run for each unique key emitted by map() • emits final output • Combines all intermediate values for a particular key • Produces a set of merged output values (usu just one)
Count words in docs: An Example • Input consists of (url, contents) pairs • map(key=url, val=contents): • For each word w in contents, emit (w, “1”) • reduce(key=word, values=uniq_counts): • Sum all “1”s in values list • Emit result “(word, sum)”
Count, Illustrated map(key=url, val=contents): For each word w in contents, emit (w, “1”) reduce(key=word, values=uniq_counts): Sum all “1”s in values list Emit result “(word, sum)” see 1 bob 1 run 1 see 1 spot 1 throw 1 bob 1 run 1 see 2 spot 1 throw 1 see bob throw see spot run
MapReduce WordCount Example • “Mapper” nodes are responsible for the map function • map(String input_key, String input_value): // input_key : document name (or line of text) // input_value: document contentsfor each word w in input_value: EmitIntermediate(w, "1"); • “Reducer” nodes are responsible for the reduce function • reduce(String output_key, Iterator intermediate_values): // output_key : a word // output_values: a list of countsint result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); • Data on a distributed file system (DFS)
Execution Overview • How is this distributed? • Partition input key/value pairs into chunks, run map() tasks in parallel • After all map()s are complete, consolidate all emitted values for each unique emitted key • Now partition space of output map keys, and run reduce() in parallel • If map() or reduce() fails, reexecute!
MapReduceWordCount Diagram file1 file2 file3 file4 file5 file6 file7 ah ah er ah ifor oruh or ah if map(String input_key, String input_value): // input_key : doc name // input_value: doc contentsfor each word w in input_value: EmitIntermediate(w, "1"); ah:1 if:1or:1 or:1uh:1 or:1 ah:1 if:1 ah:1 ah:1 er:1 ah:1,1,1,1 er:1 if:1,1 or:1,1,1 uh:1 reduce(String output_key, Iterator intermediate_values): // output_key : a word // output_values: a list of countsint result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); 4 1 2 3 1 (ah) (er) (if) (or) (uh)
Map • Reads contents of assigned portion of input-file • Parses and prepares data for input to map function (e.g. read <a /> from HTML) • Passes data into map function and saves result in memory (e.g. <target, source>) • Periodically writes completed work to local disk • Notifies Master of this partially completed work (intermediate data)
Reduce • Receives notification from Master of partially completed work • Retrieves intermediate data from Map-Machine via remote-read • Sorts intermediate data by key (e.g. by target page) • Iterates over intermediate data • For each unique key, sends corresponding set through reduce function • Appends result of reduce function to final output file (GFS)
Task Granularity & Pipelining • Fine granularity tasks: map tasks >> machines • Minimizes time for fault recovery • Can pipeline shuffling with map execution • Better dynamic load balancing • Often use 200,000 map & 5000 reduce tasks • Running on 2000 machines
Fault Tolerance / Workers Handled via re-execution • Detect failure : periodically ping workers • Any machine who does not respond is considered “dead” • Re-execute completed + in-progress map tasks • Data stored on local machine becomes unreachable. • Re-execute in progress reduce tasks • Task completion committed through master Robust: lost 1600/1800 machines once finished ok
Master Failure • Could be handled by making the write periodic checkpoints of the master data structures. • But don't yet • (master failure unlikely)
Refinement: Redundant Execution Slow workers significantly delay completion time • Other jobs consuming resources on machine • Bad disks w/ soft errors transfer data slowly • Weird things: processor caches disabled (!!) Solution: Near end of phase, schedule redundant execution of in-process tasks • Whichever one finishes first "wins" Dramatically shortens job completion time
Refinement: Locality Optimization • Master scheduling policy: • Asks GFS for locations of replicas of input file blocks • Map tasks typically split into 64MB (GFS block size) • Map tasks scheduled so GFS input block replica are on same machine or same rack • Effect • Thousands of machines read input at local disk speed • Without this, rack switches limit read rate
RefinementSkipping Bad Records • Map/Reduce functions sometimes fail for particular inputs • Best solution is to debug & fix • Not always possible ~ third-party source libraries • On segmentation fault: • Send UDP packet to master from signal handler • Include sequence number of record being processed • If master sees two failures for same record: • Next workeris told to skip the record ( it is acceptable to ignore when doing statistical analysis on a large data set)
Other Refinements • Sorting guarantees • within each reduce partition • Compression of intermediate data • Combiner • Useful for saving network bandwidth • Local execution for debugging/testing • User-defined counters
MapReduce: In Summary • Now it’s easy to program for many CPUs • Communication management effectively gone • I/O scheduling done for us • Fault tolerance, monitoring • machine failures, suddenly-slow machines, etc are handled • Can be much easier to design and program! • Can cascade several (many?) MapReduce tasks • But … it further restricts solvable problems • Might be hard to express problem in MapReduce • Data parallelism is key • Need to be able to break up a problem by data chunks • MapReduce is closed-source (to Google) C++ • Hadoop is open-source Java-based rewrite