490 likes | 1.07k Views
Join Using MapReduce. Cloud Group, WAMDM Youzhong MA May 11, 2012. Outline. Overview about Join Using MapReduce Details Efficient Parallel kNN Joins for Large Data in MapReduce [EDBT’2012] Efficient Parallel Set-Similarity Joins Using MapReduce [SIGMOD’2010]
E N D
Join Using MapReduce Cloud Group, WAMDM Youzhong MA May 11, 2012
Outline • Overview about Join Using MapReduce • Details • Efficient Parallel kNN Joins for Large Data in MapReduce[EDBT’2012] • Efficient Parallel Set-Similarity Joins Using MapReduce[SIGMOD’2010] • Parallel Top-K Similarity Join Algorithms using MapReduce[ICDE’2012] • Conclusion • Trajectory Similarity Join Using MapReduce
Overview about Join Using MapReduce Basic Join Complex Join
Outline • Overview about Join Using MapReduce • Details • Efficient Parallel kNN Joins for Large Data in MapReduce [EDBT’2012] • Efficient Parallel Set-Similarity Joins Using MapReduce [SIGMOD’2010] • Parallel Top-K Similarity Join Algorithms using MapReduce [ICDE’2012] • Conclusion • Trajectory Similarity Join Using MapReduce
Introduction • k nearest neighbor join (kNN join) • Given two data sets R and S, for every point q in R, kNN join returns k nearest points of q from S. 3-NN join for q (q, p1) (q, p3) (q, p4) Find kNN in S for points in R Applications: data mining, spatial databases , etc. • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Introduction • Exact kNN Join • knn(r ; S) = set of kNN of r from S. • knnJ(R; S) = {(r ; knn(r ; S)) | for all rR}. • Approximate kNN Join • aknn(r, S) = approximate kNN of r from S. • p = kth NN of r in knn(r, S). • p’ = kth NN for r in aknn(r, S) • aknn(r, S) is a c-approximation of knn(r, S): d(r, p) ≤ d(r, p’) ≤ c · d(r, p). • aknnJ(R, S)= {(r, aknn(r, S))|∀r ∈ R}. • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Exact kNN join: Block Nested Loop Join • Block nested loop join (BNLJ) based method • Partition R and S, each into n equal-sized disjoint blocks. • Perform (BNLJ) for each possible Ri ,Sj pairs of blocks. • Get global kNN results from n local kNN results for every record in R. R1 BNLJ (R1, S1) R BNLJ (R1, S) R2 BNLJ (R1, S2) BNLJ (R2, S1) S1 S BNLJ (R2, S) BNLJ (R2, S2) S2 • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Exact kNN join: Block Nested Loop Join • Two-round MapReduce algorithm: Round 1 • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Exact kNN join: Block Nested Loop Join • Two-round MapReduce algorithm: Round 2 BNLJ (R1, S1) BNLJ (R1, S2) • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Exact kNN join: Block R-tree Join • Use spatial index (R-tree) to improve performance • Build R-tree index for a block of S in a bucket to speed up kNN computations. • Similar to BNLJ algorithm, only need to replace BNLJ with block R-tree join (BRJ) in the first round. BRJ • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Approximate kNN Join • Problems with exact kNN join solution • Too much communication and computation (n2 buckets required) • Find solution requiring O(n) buckets. • We search for approximate solutions. • Space-filling curve based methods ([ICDE10], dubbed zkNN) n2 buckets required, too much cost [YLK10] B. Yao, F. Li, P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE, 2010. • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Approximate kNNJoin:Z-order kNN join • The idea of zkNN • Transform d-dimensional points to 1-D values using Z-value. • Map d-dimensional kNN join query to 1-D range queries. • Multiple random shift copies are used to improve spatial locality. q q • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Approximate kNNJoin:Z-order kNN join • The idea of zkNN • Transform d-dimensional points to 1-D values using Z-value. • Map d-dimensional kNN join query to 1-D range queries. • Multiple random shift copies are used to improve spatial locality. • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Approximate kNN Join:H-zkNNJ • Apply zkNN for join in MapReduce (H-zkNNJ) • Partition based algorithm • Partitioning policy: • To achieve linear communication and computation costs (to the number of blocks n in each input data set) • Partitioning by z-values: • Partition input data sets Riand Si into {Ri,1, ..., Ri,n} and {Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n} Small neighborhood search K=2 • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Approximate kNN Join:H-zkNNJ • Apply zkNN for join in MapReduce (H-zkNNJ) • Partition based algorithm • Partitioning policy: • To achieve linear communication and computation costs (to the number of blocks n in each input data set) • Partitioning by z-values: • Partition input data sets Riand Si into {Ri,1, ..., Ri,n} and {Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n} K=3 • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Approximate kNN Join:H-zkNNJ • Choice of partitioning values. • Each block of Ri and Si shares the same boundary so we only search a small neighborhood and minimize communication. • Goal: load balance. • Evenly partition Ri or Si • Computation of partitioning values. • Quantiles can be used for evenly partitioning a data set D. • Sort a data set D and retrieve its (n − 1) quantiles (expensive). • We propose sampling based method to estimate quantiles. • We proved that both estimations are close enough to the original ranks with a high probability (1-e −2/). • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Approximate kNN Join:H-zkNNJ • H −zkNNJ algorithm can be implemented in 3 rounds of MapReduce. • Round 1: construct random shift copies for R and S, Ri and Si , i ∈ [1,α], and generate partitioning values for Ri and Si • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Approximate kNN Join:H-zkNNJ • H −zkNNJ algorithm can be implemented in 3 rounds of MapReduce. • Round 2: partition Ri and Si into blocks and compute the candidate points for knn(r, S) for any r ∈ R. • Round 3: determine knn(r, C(r)) of any r ∈ R from the (r, Ci (r)) emitted by round 2. • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Experiments • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Experiments Running time Communication cost • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Outline • Overview about Join Using MapReduce • Details • Efficient Parallel kNN Joins for Large Data in MapReduce [EDBT’2012] • Efficient Parallel Set-Similarity Joins Using MapReduce [SIGMOD’2010] • Parallel Top-K Similarity Join Algorithms using MapReduce [ICDE’2012] • Conclusion • Trajectory Similarity Join Using MapReduce
Motivating Scenarios • Detecting Plagiarism • Before publishing a Journal, editors have to make sure there is no plagiarized paper among the hundreds of papers to be included in the Journal • Near-duplicate elimination • The archive of a search engine can contain multiple copies of the same page • Reasons: re-crawling, different hosts holding the same redundant copies of a page, etc. • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]
Problem Statement • Problem Statement: • Given two collections of objects/items/records, a similarity metric sim(o1,o2) and a threshold λ , find the pairs of objects/items/records satisfying sim(o1,o2)> λ • Solution: • Similarity Join • Some of the collections are enormous: • Google N-gram database : ~1trillion records • GeneBank : 416GB of data • Facebook : 400 million active users Try to process this data in a parallel, distributed way => MapReduce • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]
{word1,word2 ….…. wordn} {word1,word2 ….…. wordn} Set-Similarity Join(SSJoin) • SSJoin: a powerful primitive for supporting (string-)similarity joins • Input: 2 collections of sets • Goal: Identify all pairs of highly similar sets S1={…} S2={…} …. Sn={…} T1={…} T2={…} … Tn={…} SSJoinpred pred: sim(Si,Ti)>0.3 • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Set-Similarity Join • Most SSJoin algorithms are signature-based: • Signatures: • Have a filtering effect: SSJoin algorithm compares only candidates not all pairs • Ensure correctness: Sign(r) ∩ Sign(s) , whenever Sim(r, s) ≥ λ; • One possible signature scheme: Prefix-filtering • Compute Global Ordering of Tokens: Marat …W. Safin ... Rafael ... Nadal ...P. … Smith …. John • Compute Signature of each input set: take the prefix of length n Sign({John, W., Smith})=[W., Smith] Sign({Marat,Safin})=[Marat, Safin] Sign({Rafael, P., Nadal})=[Rafael,Nadal] • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]
Set-Similarity Join • Filtering Phase: Before doing the actual SSJoin, cluster/group the candidates • Run the SSjoin on each cluster => less workload {Smith, John} {John, W., Smith} … {Safin,Marat,Michailowitsc} … … {Marat, Safin} {Nadal , Rafael, Parera} {Rafael, P., Nadal} cluster/bucket1 cluster/bucket2 cluster/bucketN • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]
Parallel Set-Similarity Join • Method comprises 3 stages: Compute data statistics for good signatures Group candidates based on signature & Compute SSJoin Generate actual pairs of joined records Stage I: Token Ordering Stage III: Record Join Stage II RID-Pair Generation • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]
Stage I: Data Statistics(Basic Token Ordering) • Creates a global ordering of the tokens in the join column, based on their frequency • 2 MapReduce cycles: • 1st : computing token frequencies • 2nd: ordering the tokens by their frequencies a c RID b Global Ordering: (based on frequency) • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]
, , Basic Token Ordering – 1stMapReduce cycle • map: • tokenize the join • value of each record • emit each token • with no. of occurrences 1 • reduce: • for each token, compute total • count (frequency) • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]
Basic Token Ordering – 2nd MapReduce cycle • Map • interchange key with value • reduce(use only 1 reducer) • emits the value • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]
Stage 2:RID-Pair Generation -- Map Phase • scan input records and for each record: • project it on RID & join attribute • tokenize it • extract prefix according to global ordering of tokens obtained in the Token Ordering stage • route tokens to appropriate reducer Global ordering of tokens obtained in the previous stage • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]
Routing: using individual tokens • Treats each token as a key • For each record, generates a (key, value) pair for each of its prefix tokens: • Example: • Given the global ordering: • “A B C” • => prefix of length 2: A,B • => generate/emit 2 (key,value) pairs: • (A, (1,A B C)) • (B, (1,A B C)) • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]
Grouping/Routing: using individual tokens • Advantage: • high quality of grouping of candidates( pairs of records that have no chance of being similar, are never routed to the same reducer) • Disadvantage: • high replication of data (same records might be checked for similarity in multiple reducers, i.e. redundant work) • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]
Routing: Using Grouped Tokens • Example: • Given the global ordering: • “A B C” => prefix of length 2: A,B • Suppose A,B belong to group X and • C belongs to group Y • => generate/emit 2 (key,value) pairs: • (X, (1,A B C)) • (Y, (1,A B C)) • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]
Grouping/Routing: Using Grouped Tokens • Advantage: • Replication of data is not so pervasive • Disadvantage: • Quality of grouping is not so high (records having no chance of being similar are sent to the same reducer which checks their similarity) • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]
RID-Pair Generation: Reduce Phase • This is the core of the entire method • Each reducer processes one/more buckets • In each bucket, the reducer looks for pairs of join attribute values satisfying the join predicate If the similarity of the 2 candidates >= threshold => output their ids and also their similarity Bucket of candidates • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]
Stage III: Generate pairs of joined records • Until now we have only pairs of RIDs, but we need actual records • Uses 2 MapReduce cycles • 1st cycle: fills in the record information for each half of each pair • 2nd cycle: brings together the previously filled in records • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]
Handling Insufficient Memory • Map-Based Block Processing. • In this approach, the map function replicates the blocks and interleaves them in the order they will be processed by the reducer. • Reduce-Based Block Processing. • In this approach, the map function sends each block exactly once. • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]
Evaluation BTO: basic token ordering BK: Basic kernel BRJ: Basic record join PK: PPJoin +kernel OPRJ: one phase record join • Cluster: 10-node IBM x3650, running Hadoop • Data sets: • DBLP: 1.2M publications • CITESEERX: 1.3M publication • Best algorithm: BTO-PK-OPRJ • Most expensive stage: the RID-pair generation • Fixed data size, vary the cluster size • Best time: BTO-PK-OPRJ • Efficient Parallel Set-Similarity Joins Using MapReduce[RaresVernica et al. SIGMOD’2010]
Conclusion • KNN Join is computation-intensive. • Minimize communication and computation. • Effective filtering strategy to reduce the candidate pairs. • Parallel kNN Joins [EDBT’2012] • Space-filling curve based methods ([YLK10], dubbed zkNN) • n2buckets required O(n) buckets. • Efficient Parallel Set-Similarity Joins [SIGMOD’2010] • Prefix-filtering principle • Reduce candidate pairs • Good partition strategy to achieve good load balance. • Parallel kNN Joins [EDBT’2012] • Evenly partitioning the dataset using sampling method • Efficient Parallel Set-Similarity Joins [SIGMOD’2010] • Global Ordering: (based on frequency)
Outline • Overview about Join Using MapReduce • Details • Efficient Parallel kNN Joins for Large Data in MapReduce [EDBT’2012] • Efficient Parallel Set-Similarity Joins Using MapReduce [SIGMOD’2010] • Parallel Top-K Similarity Join Algorithms using MapReduce [ICDE’2012] • Conclusion • Trajectory Similarity Join Using MapReduce
Problem statement • Trajectory • A trajectory T is a sequence of pairs , where. • Trajectory Join • Given two sets of trajectories R and S, a threshold , the result of the trajectory join query is a subset V of pairs where (), such that the distance , for any pair in V and a given user defined distance function D.
Solutions • Naïve approaches • Block nested loop join (BNLJ) based method • Block nested loop join + Sliding window R1 BNLJ (R1, S1) R R2 BNLJ (R1, S2) BNLJ (R2, S1) S1 S BNLJ (R2, S2) S2 • Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Improved approaches • Symbolic representation for trajectories based on the Piecewise Aggregate Approximation (PAA) technique • Challenges: Data Skew • Solutions • Using hierarchical PAA to filter the candidate pairs recursively • Dividing the dense PAA to sub-partitions
Thank you! Questions?