Joins in mapreduce

Joins in mapreduce Shamik bose Department of Computer Science, Florida State University

motivation • MapReduce is the framework of choice for big data analytics • Legacy systems do not perform very well for very large amounts of data • Joins are computationally expensive, yet unavoidable in applications Department of Computer Science, Florida State University

Types of joins • Reduce-side Join • Map-side Join • Broadcast Join • Fuzzy Join Department of Computer Science, Florida State University

Reduce-side join: map operation • Join takes place on Reduce side • Map is used to pre-process the data • Reads one tuple at a time • The join key (column) is the key to the map function • Rest of tuple is value • The key and value are both tagged with the name of the parent dataset Department of Computer Science, Florida State University

Tag for dataset 0 Dataset 0 Dataset 1 Tag for dataset 1 Department of Computer Science, Florida State University

Reduce side join: partitioning and grouping • Default partitioner is overridden • Partitioning done only on the key, not on the tag • Tuples with same key go to same reducer • Grouping function also overridden • Ensures that two keys with different tags are not treated differently by the reducer Department of Computer Science, Florida State University

Department of Computer Science, Florida State University

Reduce side join: reduce operation • Reducer invokes reduce( )function for each key group • Tuples of dataset that arrives first are buffered • Each reduce function joins the values from the buffer with the data from the stream • To make sure that values from the same dataset are not joined with each other, the tag values are required Department of Computer Science, Florida State University

Reduce side join for multi-way joins • Can be carried out in two ways • One-shot join • Cascade join • One-shot join similar to reduce-side join with few changes • List of tables passed as argument to job • Map and grouping phase similar • Reducer dynamically create buffers to hold all but the last dataset • To reduce chances of memory overflow, buffers are periodically written to disk • Cascade join • Iterative version of reduce-side join • Requires setting up multiple jobs for each two-way join Department of Computer Science, Florida State University

Map side join: overview • Alternative to reduce-side join • Requirements on datasets • Datasets must be sorted with same comparator • Partitioned using the same partitioner • Number of partitions in all datasets must be same Department of Computer Science, Florida State University

Map side join: operation • All constraints can be satisfied by simple Hadoop jobs • Datasets passed through IdentityMapper() and IdentityReducer() • Does not pre-process data • Ensures that data conforms to the constraints • Join takes place as follows • Each mapper considers a dataset partition • The corresponding partition from the other dataset is scanned • If matches for the join key are found, they are joined Department of Computer Science, Florida State University

Broadcast join: overview • One of the datasets should be small enough to fit in main memory • eg. List of items by a manufacturer (S) against sales records (R) • S << R • Also called in-memory join • The overhead of transferring data from Mappers to Reducers can be avoided • Small dataset replicated on every machine Department of Computer Science, Florida State University

Broadcast join: operation • Hadoop directives –files or –archive used to send small dataset to each machine when the job is invoked • The map() function is called for each tuple from the larger dataset • Each <key,value> pair is matched against the smaller dataset • If matches are found, then they are joined and returned to the invoking function • Further optimization if local dataset is stored into hash table Department of Computer Science, Florida State University

Map side v/s reduce side v/s broadcast Department of Computer Science, Florida State University

Fuzzy joins • Most joins in relational databases are equi-joins • For unclustered data, similarity joins are also necessary • All elements from a dataset that are within a similarity threshold are returned • Quite a few predicates are available • Hamming Distance • Edit Distance • Jacard Distance Department of Computer Science, Florida State University

Hamming distance • For a hamming distance algorithm, the problem statement is as follows Given a set S of b-bit strings and a threshold d, find the set ) • HD(s1,s2) is the number of points at which the two strings are dissimilar • Eg. ‘cat’ and ‘cot’ have a hamming distance of 1 Department of Computer Science, Florida State University

Applications of fuzzy joins • Used for recommendation engines • Collaborative filtering • Clustering algorithms Department of Computer Science, Florida State University

References • JairamChandar, Join Algorithms using Map-Reduce, University of Edinburgh • Foto N. Afrati, Jeffrey D. Ullman, Optimizing Joins in a Map-Reduce Environment, ACM EBDT 2010 • Foto N. Afrati, Anish Das Sarma, David Menestrina, Aditya Parameswaran, Jeffrey D. Ullman, Fuzzy Joins using MapReduce Department of Computer Science, Florida State University

Questions? Department of Computer Science, Florida State University

Joins in mapreduce

Joins in mapreduce

Presentation Transcript

MapReduce

MapReduce

MapReduce

Processing Theta-Joins using MapReduce Authors: Okcan , Riedewald SIGMOD 2011

MapReduce

MapReduce

MapReduce

MapReduce in Action

Sort in MapReduce

Efficient Parallel kNN Joins for Large Data in MapReduce

MapReduce

JOINs

MapReduce

Joins in Hadoop

JOINS

Joins

JOINS

MapReduce

MapReduce

Joins

MapReduce