1 / 21

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters. Presented By: Niketan R. Pansare (np6@rice.edu). Outline. Introduction: Map-Reduce, Databases Motivation & Contributions of this paper. Map-Reduce-Merge framework Implementation of relational operators Conclusion.

dorjan
Download Presentation

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Presented By: Niketan R. Pansare (np6@rice.edu)

  2. Outline • Introduction: Map-Reduce, Databases • Motivation & Contributions of this paper. • Map-Reduce-Merge framework • Implementation of relational operators • Conclusion

  3. Introduction: Map Reduce • Programming model: • Processing large data sets • Cheap, unreliable hardware • Distributed processing (recovery, coordination,…) • High degree of transparencies • The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. • The user of the MapReduce library expresses the computation as two functions: map and reduce. (Note, map is not the higher order function, but a function passed to it.) • map (k1,v1) -> list(k2,v2) • reduce (k2, list (v2)) -> list(v2)

  4. Map Reduce: Big Picture

  5. MapReduce: Pros/Cons • Extremely good for data processing tasks on homogenous data sets (Distributed Grep, Count of URL access frequency). • Bad for heterogeneous data sets (Eg: Employee, Department, Sales, …) • Can process heterogeneous data sets by “Homogenization” (i.e. inserts two extra attributes: key, data-source tag) • Lots of extra disk space • Excessive map-reduce communications • Limited to queries that can be re-written as equi-joins

  6. Introduction: Databases • Simple extension of set theory (excluding Normalization, Indexing: See Codd’s paper from 1970). • Supports the following operators: • 4 mathematical operations (Union, Intersection, Difference, Cartesian Products) • 4 extension operators (Selection, Projection, Join, Relational Division) proposed by Codd. • Others: Aggregation, Groupby, Orderby • Example: select s_name from Student s, Courses c where s.s_course_id = c.c_course_id AND c.c_course_name = ‘COMP 620’

  7. What interest me the most ? • Join: It is one of the most frequently occurring operations and perhaps one of the most difficult to optimize. • Properties of join: • O/P lineage different than the I/P lineage. (Cannot easily be plugged into MapReduce) • Many different ways to implement it: • Nested loop join (Naïve algorithm) • Hash join (smaller relation is hashed) • Sort-Merge join (both relations sorted on join attribute)

  8. Motivation for this paper • Databases as slow when we process large amount of data. • Why ? Fastest databases are usually implemented in shared-everything SMP architectures (Eg: Itanium 64 processors, 128 cores, no clusters). Bottleneck: Memory access (Cache + Memory + Disk). • Then why not go for shared-nothing ? • Joining two relations difficult with current frameworks (i.e Map-Reduce) • Why not extend it ?

  9. Contribution of this paper • Process heterogeneous data sets using extension of MapReduce framework. • Added Merge phase • Hierarchical work-flows • Supported different join algorithms. • Map-Reduce-Merge is “relationally complete”.

  10. Map-Reduce-Merge • Map: (k1, v1) → [(k2, v2)] • Reduce: (k2, [v2]) → [v3] becomes: • Map: (k1, v1) → [(k2, v2)] • Reduce: (k2, [v2]) → (k2, [v3]) • Merge: ((k2, [v3]), (k3, [v4])) → (k4, [v5])

  11. Programmer-Definable operations • Partition selector - which data should go to which merger? • Processor - process data on an individual source. • Merger - analogous to the map and reduce definitions, define logic to do the merge operation. • Configurable iterators - how to step through each of the lists as you merge

  12. Projection • Return the subset of the data passed in. • Mapper can handle this: • Map: (k1, v1) → [(k2, v2)] • v2 may have different schema than v1.

  13. Aggregation • At the Reduce phase, Map-Reduce performs the sort-by-key and group-by-key functions to ensure that the input to a reducer is a set of tuples t = (k, [v]) in which [v] is the collection of all the values associated with the key k. • Therefore, reducer can implement “group-by” and “aggregate” operators.

  14. Selection • If selection condition involves only the attributes of one data source, can implement in mappers. • If it’s on aggregates or a group of values contained in one data source, can implement in reducers. • If it involves attributes or aggregates from both data sources, implement in mergers.

  15. Set union / intersection / difference • Let each of the two MapReduces emit a sorted list of unique elements. • Therefore, a naïve merge-like algorithm in merge sort can perform set union/intersection/difference (i.e. iteration over two sorted lists).

  16. Cartesian Products • Set the reducers up to output the two sets you want the Cartesian product of. • Each merger will get one partition F from the first set of reducers, and the full set of partitions S from the second. • Each merger emits F x S.

  17. Sort-Merge Join • Map: partition records into key ranges according to the values of the attributes on which you’re sorting, aiming for even distribution of values to mappers. • Reduce: sort the data. • Merge: join the sorted data for each key range.

  18. Hash Join/Nested loop • Map: use the same hash function for both sets of mappers. • Reduce: produce a hash table from the values mapped. (For nested loop: Don’t hash) • Merge: operates on corresponding hash buckets. Use one bucket as a build set, and the other as a probe. (For nested loop: do loop-match instead of hash-probe.)

  19. Conclusions • Map-Reduce-Merge programming model retains Map-Reduce’s many great features, while adding relational algebra to the list of database principles it upholds. • Suggestion (or more like future-work) to use Map-Reduce-Merge as framework for parallel databases.

  20. Thank You. Questions ?

  21. References • "MapReduce: Simplified Data Processing on Large Clusters" — paper by Jeffrey Dean and Sanjay Ghemawat; from Google Labs • "Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters" — paper by Hung-Chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker; from Yahoo and UCLA; published in Proc. of ACM SIGMOD, pp. 1029—1040, 2007 • David DeWitt; Michael Stonebraker. "MapReduce: A major step backwards". databasecolumn.com. http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html. Retrieved 2008-08-27 • "Google's MapReduce Programming Model -- Revisited" — paper by Ralf Lämmel; from Microsoft • Codd, E.F. (1970). "A Relational Model of Data for Large Shared Data Banks". Communications of the ACM13 (6): 377–387 • http://www.tpc.org/tpch/results/tpch_result_detail.asp?id=107061802 • http://cs.baylor.edu/~speegle/5335/2007slides/MapReduceMerge.pdf

More Related