Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Presented By: Niketan R. Pansare (np6@rice.edu)

Outline • Introduction: Map-Reduce, Databases • Motivation & Contributions of this paper. • Map-Reduce-Merge framework • Implementation of relational operators • Conclusion

Introduction: Map Reduce • Programming model: • Processing large data sets • Cheap, unreliable hardware • Distributed processing (recovery, coordination,…) • High degree of transparencies • The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. • The user of the MapReduce library expresses the computation as two functions: map and reduce. (Note, map is not the higher order function, but a function passed to it.) • map (k1,v1) -> list(k2,v2) • reduce (k2, list (v2)) -> list(v2)

Map Reduce: Big Picture

MapReduce: Pros/Cons • Extremely good for data processing tasks on homogenous data sets (Distributed Grep, Count of URL access frequency). • Bad for heterogeneous data sets (Eg: Employee, Department, Sales, …) • Can process heterogeneous data sets by “Homogenization” (i.e. inserts two extra attributes: key, data-source tag) • Lots of extra disk space • Excessive map-reduce communications • Limited to queries that can be re-written as equi-joins

Introduction: Databases • Simple extension of set theory (excluding Normalization, Indexing: See Codd’s paper from 1970). • Supports the following operators: • 4 mathematical operations (Union, Intersection, Difference, Cartesian Products) • 4 extension operators (Selection, Projection, Join, Relational Division) proposed by Codd. • Others: Aggregation, Groupby, Orderby • Example: select s_name from Student s, Courses c where s.s_course_id = c.c_course_id AND c.c_course_name = ‘COMP 620’

What interest me the most ? • Join: It is one of the most frequently occurring operations and perhaps one of the most difficult to optimize. • Properties of join: • O/P lineage different than the I/P lineage. (Cannot easily be plugged into MapReduce) • Many different ways to implement it: • Nested loop join (Naïve algorithm) • Hash join (smaller relation is hashed) • Sort-Merge join (both relations sorted on join attribute)

Motivation for this paper • Databases as slow when we process large amount of data. • Why ? Fastest databases are usually implemented in shared-everything SMP architectures (Eg: Itanium 64 processors, 128 cores, no clusters). Bottleneck: Memory access (Cache + Memory + Disk). • Then why not go for shared-nothing ? • Joining two relations difficult with current frameworks (i.e Map-Reduce) • Why not extend it ?

Contribution of this paper • Process heterogeneous data sets using extension of MapReduce framework. • Added Merge phase • Hierarchical work-flows • Supported different join algorithms. • Map-Reduce-Merge is “relationally complete”.

Map-Reduce-Merge • Map: (k1, v1) → [(k2, v2)] • Reduce: (k2, [v2]) → [v3] becomes: • Map: (k1, v1) → [(k2, v2)] • Reduce: (k2, [v2]) → (k2, [v3]) • Merge: ((k2, [v3]), (k3, [v4])) → (k4, [v5])

Programmer-Definable operations • Partition selector - which data should go to which merger? • Processor - process data on an individual source. • Merger - analogous to the map and reduce definitions, define logic to do the merge operation. • Configurable iterators - how to step through each of the lists as you merge

Projection • Return the subset of the data passed in. • Mapper can handle this: • Map: (k1, v1) → [(k2, v2)] • v2 may have different schema than v1.

Aggregation • At the Reduce phase, Map-Reduce performs the sort-by-key and group-by-key functions to ensure that the input to a reducer is a set of tuples t = (k, [v]) in which [v] is the collection of all the values associated with the key k. • Therefore, reducer can implement “group-by” and “aggregate” operators.

Selection • If selection condition involves only the attributes of one data source, can implement in mappers. • If it’s on aggregates or a group of values contained in one data source, can implement in reducers. • If it involves attributes or aggregates from both data sources, implement in mergers.

Set union / intersection / difference • Let each of the two MapReduces emit a sorted list of unique elements. • Therefore, a naïve merge-like algorithm in merge sort can perform set union/intersection/difference (i.e. iteration over two sorted lists).

Cartesian Products • Set the reducers up to output the two sets you want the Cartesian product of. • Each merger will get one partition F from the first set of reducers, and the full set of partitions S from the second. • Each merger emits F x S.

Sort-Merge Join • Map: partition records into key ranges according to the values of the attributes on which you’re sorting, aiming for even distribution of values to mappers. • Reduce: sort the data. • Merge: join the sorted data for each key range.

Hash Join/Nested loop • Map: use the same hash function for both sets of mappers. • Reduce: produce a hash table from the values mapped. (For nested loop: Don’t hash) • Merge: operates on corresponding hash buckets. Use one bucket as a build set, and the other as a probe. (For nested loop: do loop-match instead of hash-probe.)

Conclusions • Map-Reduce-Merge programming model retains Map-Reduce’s many great features, while adding relational algebra to the list of database principles it upholds. • Suggestion (or more like future-work) to use Map-Reduce-Merge as framework for parallel databases.

Thank You. Questions ?

References • "MapReduce: Simplified Data Processing on Large Clusters" — paper by Jeffrey Dean and Sanjay Ghemawat; from Google Labs • "Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters" — paper by Hung-Chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker; from Yahoo and UCLA; published in Proc. of ACM SIGMOD, pp. 1029—1040, 2007 • David DeWitt; Michael Stonebraker. "MapReduce: A major step backwards". databasecolumn.com. http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html. Retrieved 2008-08-27 • "Google's MapReduce Programming Model -- Revisited" — paper by Ralf Lämmel; from Microsoft • Codd, E.F. (1970). "A Relational Model of Data for Large Shared Data Banks". Communications of the ACM13 (6): 377–387 • http://www.tpc.org/tpch/results/tpch_result_detail.asp?id=107061802 • http://cs.baylor.edu/~speegle/5335/2007slides/MapReduceMerge.pdf

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters