790 likes | 957 Views
Map/Reduce Programming Model. Ahmed Abdelsadek. Outlines. Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and Libraries built on top of Map/Reduce. Introduction. Big Data Scaling ‘out’ not ‘up’ Scaling ‘everything’ linearly with data size
E N D
Map/Reduce Programming Model Ahmed Abdelsadek
Outlines • Introduction • What is Map/Reduce? • Framework Architecture • Map/Reduce Algorithm Design • Tools and Libraries built on top of Map/Reduce
Introduction • Big Data • Scaling ‘out’ not ‘up’ • Scaling ‘everything’ linearly with data size • Data-intensive applications
Map/Reduce • Origins • Google Map/Reduce • Hadoop Map/Reduce • The Map and Reduce functions are both defined with respect to data structured in (key, value) pairs.
Mapper • The Map function takes a key/value pair, processes it, and generates zero or more output key/value pairs. • The input and output types of the mapper can be different from each other.
Reducer • The Reduce function takes a key and a series of all values associated with it, processes it, and generates zero or more output key/value pairs. • The input and output types of the reducer can be different from each other.
Mappers/Reducers • map: (k1; v1) -> [(k2; v2)] • reduce: (k2; [v2]) -> [(k3; v3)]
WordCount Example • Problem: count the number of occurrences of every word in a text collection. Map(docid a, doc d) for all term t in doc d do Emit(term t, count 1) Reduce(term t; counts [c1, c2, …]) sum = 0 for all count c in counts [c1, c2, …] do sum = sum + c Emit(term t, count sum)
Architecture - Overview • Map/Reduce runs on top of DFS
Fault Tolerance • Task Fails • Re-execution • TaskTracker Fails • Removes the node from pool of TaskTrackers • Re-schedule its tasks • JobTracker Fails • Singe point of failure. Job fails
Map/Reduce Framework Features • Locality • Move code to the data • Task Granularity • Mappers and reducers should be much larger than the number of machines, however, not too much! • Dynamic load balancing! • Backup Tasks • Avoid slow workers • Near completion
Map/Reduce Framework Features • Skipping bad records • Many failures on the same record • Local execution • Debug in isolation • Status information • Progress of computations • User Counters, report progress • Periodically propagated to the master node
HadoopStreaming and Pipes • APIs to MapReduce that allows you to write your map and reduce functions in languages other than Java • Hadoop Streaming • Uses Unix standard streams as the interface between Hadoop and your program • You can use any language that can read standard input and write to standard output • Hadoop Pipes (for C++) • Pipes uses sockets as the channel to communicates with the process running the C++ map or reduce function • JNI is not used
Keep in Mind • Programmer has little control over many aspects of execution • Where a mapper or reducer runs (i.e., on which node in the cluster). • When a mapper or reducer begins or finishes • Which input key-value pairs are processed by a specific mapper. • Which intermediate key-value pairs are processed by a specific reducer.
Partitioners • Dividing up the intermediate key space. • Simplest: Hash value of the key mod the number of reducers • Assigns same number of keys to reducers • Only considers the key and ignores the value • May yield large differences in the number of values sent to each reducer • More complex partitioning algorithm to handle the imbalance in the amount of data associated with each key
Combiners • In WordCount example: the amount of intermediate data is larger than the input collection itself • Combiners are an optimization for local aggregation before the shuffle and sort phase • Compute a local count for a word over all the documents processed by the mapper • Think of combiners as “mini-reducers” • However, combiners and reducers are not always interchangeable • Combiner input and output pair are same as mapper output pairs • Same as reducer input pair • Combiner may be invoked zero, one, or multiple times • Combiner can emit any number of key-value pairs
Local Aggregation • Network and disk latency are high! • Features help local aggregation • Single (Java) Mapper object for multiple (key,value) pairs in an input split (preserve state across multiple calls of the map() method) • Share in-object data structures and counters • Initialization, and finalization code across all map() calls in a single task • JVM reuse across multiple tasks on the same machine
Per-Document Aggregation • Associative array inside the map() call to sum up term counts within a single document • Emits a key-value pair for each uniqueterm, instead of emitting a key-value pair for each term in the document • substantial savings in the number of intermediate key-value pairs emitted
Per-Mapper Aggregation • Associative array inside the Mapper object to sum up term counts across multiple documents
In-Mapper Combining • Pros • More control over when local aggregation occurs and how it exactly takes place (recall: no guarantees on combiners) • More efficient than using actual combiners • No additional overhead with object creation, serializing, reading, and writing the key-value pairs • Cons • Breaks the functional programming (not a big deal!) • Scalability bottleneck • Needs sufficient memory to store intermediate results • Solution: Block and flush, every N key-value pairs have been processed or every M bytes have been used.
Correctness with Local Aggregation • Combiners are viewed as optional optimizations • Correctness of algorithm should not depend on its computations • Combiners and reducers are not interchangeable • Unless reduce computation is both commutative and associative • Make sure of the semantics of your aggregation algorithm • Notice for example
Pair and Stripes • In some problems: common approach is to construct complex keys and values to achieve more efficiency • Example: Problem of building word co-occurrence matrix from large document collection • Formally, the co-occurrence matrix of a corpus is a square N x N matrix where n is the number of unique words in the corpus • Cell Mijcontains the number of times word Wi co-occuredwith word Wj
Pairs Approach • Mapper: emits co-occurring words pair as the key and the integer one • Reducer: sums up all the values associated with the same co-occurring word pair
Pairs Approach • Pairs algorithm generates a massive number of key-value pairs • Combiners have few opportunities to perform local aggregation • The sparsity of the key space also limits the effectiveness of in-memory combining
Stripes Approach • Store co-occurrence information in an associative array • Mapper: emits words as keys and associative arrays as values • Reducer: element-wise sum of all associative arrays of the same key
Stripes Approach • Much more compact representation • Much fewer intermediate key-value pairs • More opportunities to perform local aggregation • May cause potential scalability bottlenecks of the algorithm.
Which approach is faster? • APW (Associated Press Worldstream): corpus of 2.27 million documents totaling 5.7 GB
Computing Relative Frequencies • In the previous example, (Wi,Wj) co-occurrence may be high just because one of the words is very common! • Solution: Compute relative frequencies
Relative Frequencies with Stripes • Straightforward! • In Reducer: • Sum all words counts co-occur with the key word • Divide the counts by that sum to get the relative frequency! • Lessons: • Use of complex data structures to coordinate distributed computations • Appropriate structuring of keys and values, bring together all the pieces of data required to perform a computation • Drawback? • As with before, this algorithm also assumes that each associative array fits into memory (Scalability bottleneck!)
Relative Frequencies with Pairs • Reducer receives (Wi,Wj) as the key and the counts as value • From this alone it is not possible to compute f(Wj | Wi) • Hint: Reducers like Mappers, can preserve state across multiple keys • Solution: at reducer side, buffer in memory all the words that co-occur with Wi • In essence building the associative array in the stripes approach • Problem? • Word pairs can be in any arbitrary order! • Solution: we must define the sort order of the pair • Keys are first sorted by the left word, and then by the right word • So That: when left word changes -> • Sum, calculate and emit the results, flush the memory
Relative Frequencies with Pairs • Problem? • Same left-word pairs may be sent to different reducers! • Solution? • We must ensure that all pairs with the same left word are sent to the same reducer • How? • Custom Paritioners!! • Pays attention to the left word and partition based on its hash only • Will it work? • Yeah! • Drawback? • Still scalability bottleneck!
Relative Frequencies with Pairs • Another approach? With no bottlenecks? • Can we compute or ‘have’ the sum before processing the pairs counts? • The notion of ‘before’ and ‘after’ can be seen in the ordering of the key-value pairs • This insight lies in properly sequencing the data presented to the reducer • Programmer should define the sort order of keys so that data needed earlier is presented earlier to the reducer • So now, we need two things • Compute the sum for a give word Wi • Send that sum to the reducer before any words pair where Wi is its left side
Relative Frequencies with Pairs • How? • To get the sum • Modify the Mapper to additionally emits a ‘special’ key of (Wi, *), with a value of one • To ensure the order • defining the sort order of the keys so that pairs with the special symbol of the form (Wi, *) are ordered before any other key-value pairs where the left word is Wi • In addition: • Partitionerto pay attention to only the left word
Relative Frequencies with Pairs • Example • Memory bottlenecks? • No!