1 / 78

Map/Reduce Programming Model

Map/Reduce Programming Model. Ahmed Abdelsadek. Outlines. Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and Libraries built on top of Map/Reduce. Introduction. Big Data Scaling ‘out’ not ‘up’ Scaling ‘everything’ linearly with data size

merton
Download Presentation

Map/Reduce Programming Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Map/Reduce Programming Model Ahmed Abdelsadek

  2. Outlines • Introduction • What is Map/Reduce? • Framework Architecture • Map/Reduce Algorithm Design • Tools and Libraries built on top of Map/Reduce

  3. Introduction • Big Data • Scaling ‘out’ not ‘up’ • Scaling ‘everything’ linearly with data size • Data-intensive applications

  4. Map/Reduce • Origins • Google Map/Reduce • Hadoop Map/Reduce • The Map and Reduce functions are both defined with respect to data structured in (key, value) pairs.

  5. Mapper • The Map function takes a key/value pair, processes it, and generates zero or more output key/value pairs. • The input and output types of the mapper can be different from each other.

  6. Reducer • The Reduce function takes a key and a series of all values associated with it, processes it, and generates zero or more output key/value pairs. • The input and output types of the reducer can be different from each other.

  7. Mappers/Reducers • map: (k1; v1) -> [(k2; v2)] • reduce: (k2; [v2]) -> [(k3; v3)]

  8. WordCount Example • Problem: count the number of occurrences of every word in a text collection. Map(docid a, doc d) for all term t in doc d do Emit(term t, count 1) Reduce(term t; counts [c1, c2, …]) sum = 0 for all count c in counts [c1, c2, …] do sum = sum + c Emit(term t, count sum)

  9. Map/Reduce Framework Architecture and Execution Overview

  10. Architecture - Overview • Map/Reduce runs on top of DFS

  11. Data Flow

  12. Job Timeline

  13. Job Work Flow

  14. Job Work Flow

  15. Job Work Flow

  16. Job Work Flow

  17. Job Work Flow

  18. Job Work Flow

  19. Job Work Flow

  20. Job Work Flow

  21. Job Work Flow

  22. Job Work Flow

  23. Fault Tolerance • Task Fails • Re-execution • TaskTracker Fails • Removes the node from pool of TaskTrackers • Re-schedule its tasks • JobTracker Fails • Singe point of failure. Job fails

  24. Map/Reduce Framework Features • Locality • Move code to the data • Task Granularity • Mappers and reducers should be much larger than the number of machines, however, not too much! • Dynamic load balancing! • Backup Tasks • Avoid slow workers • Near completion

  25. Map/Reduce Framework Features • Skipping bad records • Many failures on the same record • Local execution • Debug in isolation • Status information • Progress of computations • User Counters, report progress • Periodically propagated to the master node

  26. HadoopStreaming and Pipes • APIs to MapReduce that allows you to write your map and reduce functions in languages other than Java • Hadoop Streaming • Uses Unix standard streams as the interface between Hadoop and your program • You can use any language that can read standard input and write to standard output • Hadoop Pipes (for C++) • Pipes uses sockets as the channel to communicates with the process running the C++ map or reduce function • JNI is not used

  27. Keep in Mind • Programmer has little control over many aspects of execution • Where a mapper or reducer runs (i.e., on which node in the cluster). • When a mapper or reducer begins or finishes • Which input key-value pairs are processed by a specific mapper. • Which intermediate key-value pairs are processed by a specific reducer.

  28. Map/Reduce Algorithm Design

  29. Partitioners • Dividing up the intermediate key space. • Simplest: Hash value of the key mod the number of reducers • Assigns same number of keys to reducers • Only considers the key and ignores the value • May yield large differences in the number of values sent to each reducer • More complex partitioning algorithm to handle the imbalance in the amount of data associated with each key

  30. Combiners • In WordCount example: the amount of intermediate data is larger than the input collection itself • Combiners are an optimization for local aggregation before the shuffle and sort phase • Compute a local count for a word over all the documents processed by the mapper • Think of combiners as “mini-reducers” • However, combiners and reducers are not always interchangeable • Combiner input and output pair are same as mapper output pairs • Same as reducer input pair • Combiner may be invoked zero, one, or multiple times • Combiner can emit any number of key-value pairs

  31. Complete View of Map/Reduce

  32. Local Aggregation • Network and disk latency are high! • Features help local aggregation • Single (Java) Mapper object for multiple (key,value) pairs in an input split (preserve state across multiple calls of the map() method) • Share in-object data structures and counters • Initialization, and finalization code across all map() calls in a single task • JVM reuse across multiple tasks on the same machine

  33. Basic WordCount Example

  34. Per-Document Aggregation • Associative array inside the map() call to sum up term counts within a single document • Emits a key-value pair for each uniqueterm, instead of emitting a key-value pair for each term in the document • substantial savings in the number of intermediate key-value pairs emitted

  35. Per-Mapper Aggregation • Associative array inside the Mapper object to sum up term counts across multiple documents

  36. In-Mapper Combining • Pros • More control over when local aggregation occurs and how it exactly takes place (recall: no guarantees on combiners) • More efficient than using actual combiners • No additional overhead with object creation, serializing, reading, and writing the key-value pairs • Cons • Breaks the functional programming (not a big deal!) • Scalability bottleneck • Needs sufficient memory to store intermediate results • Solution: Block and flush, every N key-value pairs have been processed or every M bytes have been used.

  37. Correctness with Local Aggregation • Combiners are viewed as optional optimizations • Correctness of algorithm should not depend on its computations • Combiners and reducers are not interchangeable • Unless reduce computation is both commutative and associative • Make sure of the semantics of your aggregation algorithm • Notice for example

  38. Pair and Stripes • In some problems: common approach is to construct complex keys and values to achieve more efficiency • Example: Problem of building word co-occurrence matrix from large document collection • Formally, the co-occurrence matrix of a corpus is a square N x N matrix where n is the number of unique words in the corpus • Cell Mijcontains the number of times word Wi co-occuredwith word Wj

  39. Pairs Approach • Mapper: emits co-occurring words pair as the key and the integer one • Reducer: sums up all the values associated with the same co-occurring word pair

  40. Pairs Approach • Pairs algorithm generates a massive number of key-value pairs • Combiners have few opportunities to perform local aggregation • The sparsity of the key space also limits the effectiveness of in-memory combining

  41. Stripes Approach • Store co-occurrence information in an associative array • Mapper: emits words as keys and associative arrays as values • Reducer: element-wise sum of all associative arrays of the same key

  42. Stripes Approach • Much more compact representation • Much fewer intermediate key-value pairs • More opportunities to perform local aggregation • May cause potential scalability bottlenecks of the algorithm.

  43. Which approach is faster? • APW (Associated Press Worldstream): corpus of 2.27 million documents totaling 5.7 GB

  44. Computing Relative Frequencies • In the previous example, (Wi,Wj) co-occurrence may be high just because one of the words is very common! • Solution: Compute relative frequencies

  45. Relative Frequencies with Stripes • Straightforward! • In Reducer: • Sum all words counts co-occur with the key word • Divide the counts by that sum to get the relative frequency! • Lessons: • Use of complex data structures to coordinate distributed computations • Appropriate structuring of keys and values, bring together all the pieces of data required to perform a computation • Drawback? • As with before, this algorithm also assumes that each associative array fits into memory (Scalability bottleneck!)

  46. Relative Frequencies with Pairs • Reducer receives (Wi,Wj) as the key and the counts as value • From this alone it is not possible to compute f(Wj | Wi) • Hint: Reducers like Mappers, can preserve state across multiple keys • Solution: at reducer side, buffer in memory all the words that co-occur with Wi • In essence building the associative array in the stripes approach • Problem? • Word pairs can be in any arbitrary order! • Solution: we must define the sort order of the pair • Keys are first sorted by the left word, and then by the right word • So That: when left word changes -> • Sum, calculate and emit the results, flush the memory

  47. Relative Frequencies with Pairs • Problem? • Same left-word pairs may be sent to different reducers! • Solution? • We must ensure that all pairs with the same left word are sent to the same reducer • How? • Custom Paritioners!! • Pays attention to the left word and partition based on its hash only • Will it work? • Yeah! • Drawback? • Still scalability bottleneck! 

  48. Relative Frequencies with Pairs • Another approach? With no bottlenecks? • Can we compute or ‘have’ the sum before processing the pairs counts? • The notion of ‘before’ and ‘after’ can be seen in the ordering of the key-value pairs • This insight lies in properly sequencing the data presented to the reducer • Programmer should define the sort order of keys so that data needed earlier is presented earlier to the reducer • So now, we need two things • Compute the sum for a give word Wi • Send that sum to the reducer before any words pair where Wi is its left side

  49. Relative Frequencies with Pairs • How? • To get the sum • Modify the Mapper to additionally emits a ‘special’ key of (Wi, *), with a value of one • To ensure the order • defining the sort order of the keys so that pairs with the special symbol of the form (Wi, *) are ordered before any other key-value pairs where the left word is Wi • In addition: • Partitionerto pay attention to only the left word

  50. Relative Frequencies with Pairs • Example • Memory bottlenecks? • No!

More Related