Join Using MapReduce

Join Using MapReduce Cloud Group, WAMDM Youzhong MA May 11, 2012

Outline • Overview about Join Using MapReduce • Details • Efficient Parallel kNN Joins for Large Data in MapReduce[EDBT’2012] • Efficient Parallel Set-Similarity Joins Using MapReduce[SIGMOD’2010] • Parallel Top-K Similarity Join Algorithms using MapReduce[ICDE’2012] • Conclusion • Trajectory Similarity Join Using MapReduce

Overview about Join Using MapReduce Basic Join Complex Join

Outline • Overview about Join Using MapReduce • Details • Efficient Parallel kNN Joins for Large Data in MapReduce [EDBT’2012] • Efficient Parallel Set-Similarity Joins Using MapReduce [SIGMOD’2010] • Parallel Top-K Similarity Join Algorithms using MapReduce [ICDE’2012] • Conclusion • Trajectory Similarity Join Using MapReduce

Introduction • k nearest neighbor join (kNN join) • Given two data sets R and S, for every point q in R, kNN join returns k nearest points of q from S. 3-NN join for q (q, p1) (q, p3) (q, p4) Find kNN in S for points in R Applications: data mining, spatial databases , etc. • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]

Introduction • Exact kNN Join • knn(r ; S) = set of kNN of r from S. • knnJ(R; S) = {(r ; knn(r ; S)) | for all rR}. • Approximate kNN Join • aknn(r, S) = approximate kNN of r from S. • p = kth NN of r in knn(r, S). • p’ = kth NN for r in aknn(r, S) • aknn(r, S) is a c-approximation of knn(r, S): d(r, p) ≤ d(r, p’) ≤ c · d(r, p). • aknnJ(R, S)= {(r, aknn(r, S))|∀r ∈ R}. • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]

Exact kNN join: Block Nested Loop Join • Block nested loop join (BNLJ) based method • Partition R and S, each into n equal-sized disjoint blocks. • Perform (BNLJ) for each possible Ri ,Sj pairs of blocks. • Get global kNN results from n local kNN results for every record in R. R1 BNLJ (R1, S1) R BNLJ (R1, S) R2 BNLJ (R1, S2) BNLJ (R2, S1) S1 S BNLJ (R2, S) BNLJ (R2, S2) S2 • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]

Exact kNN join: Block Nested Loop Join • Two-round MapReduce algorithm: Round 1 • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]

Exact kNN join: Block Nested Loop Join • Two-round MapReduce algorithm: Round 2 BNLJ (R1, S1) BNLJ (R1, S2) • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]

Exact kNN join: Block R-tree Join • Use spatial index (R-tree) to improve performance • Build R-tree index for a block of S in a bucket to speed up kNN computations. • Similar to BNLJ algorithm, only need to replace BNLJ with block R-tree join (BRJ) in the ﬁrst round. BRJ • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]

Approximate kNN Join • Problems with exact kNN join solution • Too much communication and computation (n2 buckets required) • Find solution requiring O(n) buckets. • We search for approximate solutions. • Space-ﬁlling curve based methods ([ICDE10], dubbed zkNN) n2 buckets required, too much cost [YLK10] B. Yao, F. Li, P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE, 2010. • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]

Approximate kNNJoin:Z-order kNN join • The idea of zkNN • Transform d-dimensional points to 1-D values using Z-value. • Map d-dimensional kNN join query to 1-D range queries. • Multiple random shift copies are used to improve spatial locality. q q • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]

Approximate kNNJoin:Z-order kNN join • The idea of zkNN • Transform d-dimensional points to 1-D values using Z-value. • Map d-dimensional kNN join query to 1-D range queries. • Multiple random shift copies are used to improve spatial locality. • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]

Approximate kNN Join:H-zkNNJ • Apply zkNN for join in MapReduce (H-zkNNJ) • Partition based algorithm • Partitioning policy: • To achieve linear communication and computation costs (to the number of blocks n in each input data set) • Partitioning by z-values: • Partition input data sets Riand Si into {Ri,1, ..., Ri,n} and {Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n} Small neighborhood search K=2 • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]

Approximate kNN Join:H-zkNNJ • Apply zkNN for join in MapReduce (H-zkNNJ) • Partition based algorithm • Partitioning policy: • To achieve linear communication and computation costs (to the number of blocks n in each input data set) • Partitioning by z-values: • Partition input data sets Riand Si into {Ri,1, ..., Ri,n} and {Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n} K=3 • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]

Approximate kNN Join:H-zkNNJ • Choice of partitioning values. • Each block of Ri and Si shares the same boundary so we only search a small neighborhood and minimize communication. • Goal: load balance. • Evenly partition Ri or Si • Computation of partitioning values. • Quantiles can be used for evenly partitioning a data set D. • Sort a data set D and retrieve its (n − 1) quantiles (expensive). • We propose sampling based method to estimate quantiles. • We proved that both estimations are close enough to the original ranks with a high probability (1-e −2/). • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]

Approximate kNN Join:H-zkNNJ • H −zkNNJ algorithm can be implemented in 3 rounds of MapReduce. • Round 1: construct random shift copies for R and S, Ri and Si , i ∈ [1,α], and generate partitioning values for Ri and Si • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]

Approximate kNN Join:H-zkNNJ • H −zkNNJ algorithm can be implemented in 3 rounds of MapReduce. • Round 2: partition Ri and Si into blocks and compute the candidate points for knn(r, S) for any r ∈ R. • Round 3: determine knn(r, C(r)) of any r ∈ R from the (r, Ci (r)) emitted by round 2. • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]

Experiments • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]

Experiments Running time Communication cost • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]

Motivating Scenarios • Detecting Plagiarism • Before publishing a Journal, editors have to make sure there is no plagiarized paper among the hundreds of papers to be included in the Journal • Near-duplicate elimination • The archive of a search engine can contain multiple copies of the same page • Reasons: re-crawling, different hosts holding the same redundant copies of a page, etc. • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]

Problem Statement • Problem Statement: • Given two collections of objects/items/records, a similarity metric sim(o1,o2) and a threshold λ , find the pairs of objects/items/records satisfying sim(o1,o2)> λ • Solution: • Similarity Join • Some of the collections are enormous: • Google N-gram database : ~1trillion records • GeneBank : 416GB of data • Facebook : 400 million active users Try to process this data in a parallel, distributed way => MapReduce • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]

{word1,word2 ….…. wordn} {word1,word2 ….…. wordn} Set-Similarity Join(SSJoin) • SSJoin: a powerful primitive for supporting (string-)similarity joins • Input: 2 collections of sets • Goal: Identify all pairs of highly similar sets S1={…} S2={…} …. Sn={…} T1={…} T2={…} … Tn={…} SSJoinpred pred: sim(Si,Ti)>0.3 • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]

Set-Similarity Join • Most SSJoin algorithms are signature-based: • Signatures: • Have a filtering effect: SSJoin algorithm compares only candidates not all pairs • Ensure correctness: Sign(r) ∩ Sign(s) , whenever Sim(r, s) ≥ λ; • One possible signature scheme: Prefix-filtering • Compute Global Ordering of Tokens: Marat …W. Safin ... Rafael ... Nadal ...P. … Smith …. John • Compute Signature of each input set: take the prefix of length n Sign({John, W., Smith})=[W., Smith] Sign({Marat,Safin})=[Marat, Safin] Sign({Rafael, P., Nadal})=[Rafael,Nadal] • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]

Set-Similarity Join • Filtering Phase: Before doing the actual SSJoin, cluster/group the candidates • Run the SSjoin on each cluster => less workload {Smith, John} {John, W., Smith} … {Safin,Marat,Michailowitsc} … … {Marat, Safin} {Nadal , Rafael, Parera} {Rafael, P., Nadal} cluster/bucket1 cluster/bucket2 cluster/bucketN • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]

Parallel Set-Similarity Join • Method comprises 3 stages: Compute data statistics for good signatures Group candidates based on signature & Compute SSJoin Generate actual pairs of joined records Stage I: Token Ordering Stage III: Record Join Stage II RID-Pair Generation • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]

Stage I: Data Statistics(Basic Token Ordering) • Creates a global ordering of the tokens in the join column, based on their frequency • 2 MapReduce cycles: • 1st : computing token frequencies • 2nd: ordering the tokens by their frequencies a c RID b Global Ordering: (based on frequency) • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]

, , Basic Token Ordering – 1stMapReduce cycle • map: • tokenize the join • value of each record • emit each token • with no. of occurrences 1 • reduce: • for each token, compute total • count (frequency) • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]

Basic Token Ordering – 2nd MapReduce cycle • Map • interchange key with value • reduce(use only 1 reducer) • emits the value • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]

Stage 2：RID-Pair Generation -- Map Phase • scan input records and for each record: • project it on RID & join attribute • tokenize it • extract prefix according to global ordering of tokens obtained in the Token Ordering stage • route tokens to appropriate reducer Global ordering of tokens obtained in the previous stage • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]

Routing: using individual tokens • Treats each token as a key • For each record, generates a (key, value) pair for each of its prefix tokens: • Example: • Given the global ordering: • “A B C” • => prefix of length 2: A,B • => generate/emit 2 (key,value) pairs: • (A, (1,A B C)) • (B, (1,A B C)) • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]

Grouping/Routing: using individual tokens • Advantage: • high quality of grouping of candidates( pairs of records that have no chance of being similar, are never routed to the same reducer) • Disadvantage: • high replication of data (same records might be checked for similarity in multiple reducers, i.e. redundant work) • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]

Routing: Using Grouped Tokens • Example: • Given the global ordering: • “A B C” => prefix of length 2: A,B • Suppose A,B belong to group X and • C belongs to group Y • => generate/emit 2 (key,value) pairs: • (X, (1,A B C)) • (Y, (1,A B C)) • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]

Grouping/Routing: Using Grouped Tokens • Advantage: • Replication of data is not so pervasive • Disadvantage: • Quality of grouping is not so high (records having no chance of being similar are sent to the same reducer which checks their similarity) • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]

RID-Pair Generation: Reduce Phase • This is the core of the entire method • Each reducer processes one/more buckets • In each bucket, the reducer looks for pairs of join attribute values satisfying the join predicate If the similarity of the 2 candidates >= threshold => output their ids and also their similarity Bucket of candidates • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]

Stage III: Generate pairs of joined records • Until now we have only pairs of RIDs, but we need actual records • Uses 2 MapReduce cycles • 1st cycle: fills in the record information for each half of each pair • 2nd cycle: brings together the previously filled in records • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]

Handling Insufficient Memory • Map-Based Block Processing. • In this approach, the map function replicates the blocks and interleaves them in the order they will be processed by the reducer. • Reduce-Based Block Processing. • In this approach, the map function sends each block exactly once. • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]

Evaluation BTO: basic token ordering BK: Basic kernel BRJ: Basic record join PK: PPJoin +kernel OPRJ: one phase record join • Cluster: 10-node IBM x3650, running Hadoop • Data sets: • DBLP: 1.2M publications • CITESEERX: 1.3M publication • Best algorithm: BTO-PK-OPRJ • Most expensive stage: the RID-pair generation • Fixed data size, vary the cluster size • Best time: BTO-PK-OPRJ • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]

Conclusion • KNN Join is computation-intensive. • Minimize communication and computation. • Effective filtering strategy to reduce the candidate pairs. • Parallel kNN Joins [EDBT’2012] • Space-ﬁlling curve based methods ([YLK10], dubbed zkNN) • n2buckets required O(n) buckets. • Efficient Parallel Set-Similarity Joins [SIGMOD’2010] • Prefix-filtering principle • Reduce candidate pairs • Good partition strategy to achieve good load balance. • Parallel kNN Joins [EDBT’2012] • Evenly partitioning the dataset using sampling method • Efficient Parallel Set-Similarity Joins [SIGMOD’2010] • Global Ordering: (based on frequency)

Problem statement • Trajectory • A trajectory T is a sequence of pairs , where. • Trajectory Join • Given two sets of trajectories R and S, a threshold , the result of the trajectory join query is a subset V of pairs where (), such that the distance , for any pair in V and a given user defined distance function D.

Solutions • Naïve approaches • Block nested loop join (BNLJ) based method • Block nested loop join + Sliding window R1 BNLJ (R1, S1) R R2 BNLJ (R1, S2) BNLJ (R2, S1) S1 S BNLJ (R2, S2) S2 • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]

Improved approaches • Symbolic representation for trajectories based on the Piecewise Aggregate Approximation (PAA) technique • Challenges: Data Skew • Solutions • Using hierarchical PAA to filter the candidate pairs recursively • Dividing the dense PAA to sub-partitions

Thank you! Questions?

Join Using MapReduce

Join Using MapReduce

Presentation Transcript

MapReduce

Join algorithms using mapreduce

Genetic Algorithms by using MapReduce

MapReduce

MapReduce

MapReduce

Scaling Genetic Algorithms using MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

Join algorithms using mapreduce

MapReduce

MapReduce

Using MapReduce for Scalable Coreference Resolution