320 likes | 509 Views
Similarity join problem with Pass-Join-K using Hadoop. ---BY Yu Haiyang. Outline. Background The introduction of Pass-Join-K Combining Pass-Join-K with Hadoop. Background. Similarity join: Find all similar pairs from two sets. Data Cleaning. Query Relaxation Spellchecking.
E N D
Similarity join problem with Pass-Join-K using Hadoop ---BY Yu Haiyang
Outline • Background • The introduction of Pass-Join-K • Combining Pass-Join-K with Hadoop http://datamining.xmu.edu.cn
Background • Similarity join: Find all similar pairs from two sets. • Data Cleaning. • Query Relaxation • Spellchecking “PO BOX 23, Main St.” “P.O. Box 23, Main St” “imformation” “information” http://datamining.xmu.edu.cn
Background How to define similarity? Jaccard distance Cosine distance Edit distance 2014/11/13 http://datamining.xmu.edu.cn 4/32
Background Edit distance The minimum number of edit operations (insertion, deletion, and substitution) to transform one string to another. Insertion Bod Body Substitution Baby Body 2014/11/13 http://datamining.xmu.edu.cn 5/32
Background How does the edit distance compare with other two? Accuracy: {“abcdefg”,”gfedcba”} Verification time: O(mn) -> O(m+n) 2014/11/13 http://datamining.xmu.edu.cn 6/32
Background • Find similar pairs • We have two string sets ,one is {vldb,sigmod,….} ,the other is {pvldb,icde,…}. • Find some candidate pairs , and then verify these pairs. {<vldb,pvldb>,<vldb,icde>,<vldb,..>,<sigmod,pvldb>,<sigmod,icde>,….} <vldb,pvldb> Yes <vldb,icde> No http://datamining.xmu.edu.cn
Background So we have to: Finding candidate pairs. There are O(N2) if we do not prune some pairs. verifying these pairs. O(mn) 2014/11/13 http://datamining.xmu.edu.cn 8/32
Introduction of Pass-Join-K • Some obvious pruning techniques • Length –based: threshold = 2,<“ab”,”abcee”> • Shift-based: <“abcd”,”cdef”> http://datamining.xmu.edu.cn
Introduction of Pass-Join-K Partition-based pruning technique We suppose the threshold tau = 2, K=2and we have a pair <“abcdefghijk”,”abdefghk”> 2014/11/13 http://datamining.xmu.edu.cn 10/32
Introduction of Pass-Join-K Partition Scheme We have seen that the longer the substrings are, the harder they could be marched. So we break the string into tau+k parts and each part while its length equals length/(tau+k) or length/(tau+k)+1. 2014/11/13 http://datamining.xmu.edu.cn 11/32
Introduction of Pass-Join-K Partition Scheme So we break the string into tau+k parts and each part while its length equals length/(tau+k) or length/(tau+k)+1. 2014/11/13 http://datamining.xmu.edu.cn 12/32
Introduction of Pass-Join-K Partition Scheme r = “abcdefghijk” s = “abdefghk” def L11 1 3 4 2 r r r r 2014/11/13 http://datamining.xmu.edu.cn 13/32
Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; a b d e f g h k 2014/11/13 http://datamining.xmu.edu.cn 14/32
Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; 2014/11/13 http://datamining.xmu.edu.cn 15/32
Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; 2014/11/13 http://datamining.xmu.edu.cn 16/32
Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; 2014/11/13 http://datamining.xmu.edu.cn 17/32
Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; a b d e f g h k 2014/11/13 http://datamining.xmu.edu.cn 18/32
Introduction of Pass-Join-K Substring Selection So what we do is to deduce the number of substrings. More pruning techniques, please read our paper: 《Pass-Join-K多分段匹配的相似性连接算法》 2014/11/13 http://datamining.xmu.edu.cn 19/32
Introduction of Pass-Join-K Verification DP( Dynamic programming) D(m,n)=max(D(m,n-1)+1,D(m-1,n)+1,D(m-1,n-1)+flag) where flag = 1 when sm=rn , s and r are both strings. 2014/11/13 http://datamining.xmu.edu.cn 20/32
Introduction of Pass-Join-K Verification Here we suppose tau = 3 and k = 1; Tauleft = 3 Tauright = 3-3=0 2014/11/13 http://datamining.xmu.edu.cn 21/32
Combining Pass-Join-K with Hadoop Inverted index tree in hadoop (abc, 1, 11,r) (def,2,11,r) (ghi,3,11,r) (jk,4,11,r) L11 1 3 4 2 r r r r 2014/11/13 http://datamining.xmu.edu.cn 22/32
Combining Pass-Join-K with Hadoop Substrings in hadoop Suppose tau = 3, k = 1, and s = “abdefghk”, length(s) = 8. We have to generate some records such as (a,1,5,s),(a,2,6,s)(a,3,7,s),(ab,1,8,s),…,(ab,1,11,s),… 2014/11/13 http://datamining.xmu.edu.cn 23/32
Combining Pass-Join-K with Hadoop Substrings in hadoop Suppose tau = 3, k = 1, and s = “abdefghk”, length(s) = 8. We have to generate more than 2*tau*(tau+k)*m records where m is the average number that substring for each segment, such as (a,1,5,s),(a,1,6,s)(a,1,7,s),(ab,1,8,s),…,(ab,1,11,s),… 2014/11/13 http://datamining.xmu.edu.cn 24/32
Combining Pass-Join-K with Hadoop Data flows in hadoop 2014/11/13 http://datamining.xmu.edu.cn 25/32
Combining Pass-Join-K with Hadoop How to improve the performance ? We have known that as k increased , the pairs we need to verity would be decrease. As k increased, more than (tau+k+1)/(tau+k) records should be translated in Mapper phase. 2014/11/13 http://datamining.xmu.edu.cn 26/32
Combining Pass-Join-K with Hadoop Here we have 2 ways to improve our algorithm. Finding a dataset that the candidate pairs number are large enough or making tau are large enough. Decreasing the data which were generated in Mapper phase. 2014/11/13 http://datamining.xmu.edu.cn 27/32
Combining Pass-Join-K with Hadoop Decrease the data flows 2014/11/13 http://datamining.xmu.edu.cn 28/32
Combining Pass-Join-K with Hadoop Decrease the data flows The inverted index record was formulated as (substring,segmentNumber, LengthInf, Id, flag) Each record’s length is length(substring)+4*sizeof(int), and substring sometimes could be so long. Hash(substring) -> integer, then record length is 5*sizeof(int) 2014/11/13 http://datamining.xmu.edu.cn 29/32
Combining Pass-Join-K with Hadoop Decrease the data flows The substring would generate some similar records such as (a,1,5,s),(a,1,6,s)(a,1,7,s)… Each substring would generate tau+k similar segments, so we combine them as ,for example, (a,1,5,7,s). So we make the (tau+k)*4*sizeof(int) to 5*sizeof(int). 2014/11/13 http://datamining.xmu.edu.cn 30/32
Combining Pass-Join-K with Hadoop Decrease the data flows So by using two steps we have seen before, we have reduced the (length(substring)+4*sizeof(int))*(tau+k) to 5 times sizeof(int) 2014/11/13 http://datamining.xmu.edu.cn 31/32
Email: yhycai@gmail.com Thanks for patience http://datamining.xmu.edu.cn