L arge-scale Similarity Join with Edit-distance Constraints

Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30

Outline Background The introduction of Pass-Join-K Combining Pass-Join-K with Hadoop 2014/10/21 http://datamining.xmu.edu.cn 2/30

Background Similarity join: Find all similar pairs from two sets. Data Cleaning. Query Relaxation Spellchecking 2014/10/21 http://datamining.xmu.edu.cn 3/30

Background How to define similarity? Jaccard distance(词袋模型) Cosine distance Edit distance 2014/10/21 http://datamining.xmu.edu.cn 4/30

Background Edit distance The minimum number of edit operations (insertion, deletion, and substitution) to transform one string to another. Insertion Bod Body Substitution Baby Body 2014/10/21 http://datamining.xmu.edu.cn 5/30

Background How does the edit distance compare with other two? Accuracy: {“abcdefg”,”gfedcba”} Verification time: O(m+n) -> O(mn) 2014/10/21 http://datamining.xmu.edu.cn 6/30

Background Find similar pairs We have two string sets ,one is {vldb,sigmod,….} ,the other is {pvldb,icde,…}. Find some candidate pairs , and then verify these pairs. {<vldb,pvldb>,<vldb,icde>,<vldb,..>,<sigmod,pvldb>,<sigmod,icde>,….} <vldb,pvldb> Yes <vldb,icde> No 2014/10/21 http://datamining.xmu.edu.cn 7/30

Background So we have to: Finding candidate pairs. There are O(N2) if we do not prune some pairs. verifying these pairs. O(mn) 2014/10/21 http://datamining.xmu.edu.cn 8/30

Introduction of Pass-Join-K Partition-based pruning technique We suppose the threshold tau = 2, K= 1 and we have a pair <“abcde”,”ace”> 2014/10/21 http://datamining.xmu.edu.cn 10/30

Introduction of Pass-Join-K Partition-based pruning technique We suppose the threshold tau = 2, K=2and we have a pair <“abcdefghijk”,”abdefghk”> 2014/10/21 http://datamining.xmu.edu.cn 11/30

Introduction of Pass-Join-K Some obvious pruning techniques Length –based: threshold = 2,<“ab”,”abcee”> Shift-based: <“abcd”,”cdef”> 2014/10/21 http://datamining.xmu.edu.cn 12/30

Introduction of Pass-Join-K Partition Scheme We have seen that the longer the substrings are, the harder they could be marched. So we break the string into tau+k parts and each part while its length equals length/(tau+k) or length/(tau+k)+1. 2014/10/21 http://datamining.xmu.edu.cn 13/30

Introduction of Pass-Join-K Partition Scheme 2014/10/21 http://datamining.xmu.edu.cn 14/30

Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; a b d e f g h k 2014/10/21 http://datamining.xmu.edu.cn 15/30

Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; 2014/10/21 http://datamining.xmu.edu.cn 16/30

Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; a b d e f g h k 2014/10/21 http://datamining.xmu.edu.cn 19/30

Introduction of Pass-Join-K Substring Selection So what we do is to deduce the number of substrings. More pruning techniques, please read our paper: 《Pass-Join-K多分段匹配的相似性连接算法》 2014/10/21 http://datamining.xmu.edu.cn 20/30

Introduction of Pass-Join-K Verification DP( Dynamic programming) D(m,n)=max(D(m,n-1)+1,D(m-1,n)+1,D(m-1,n-1)+flag) where flag = 1 when sm=rn , s and r are both strings. 2014/10/21 http://datamining.xmu.edu.cn 21/30

Introduction of Pass-Join-K Verification Here we suppose tau = 3 and k = 1; Tauleft = 3 Tauright = 3-3=0 2014/10/21 http://datamining.xmu.edu.cn 22/30

Combining Pass-Join-K with Hadoop Big data Big file Large number of files 2014/10/21 http://datamining.xmu.edu.cn 24/30

Combining Pass-Join-K with Hadoop Inverted index tree in hadoop (abc, 1, 11,r,IFlag) (def,2,11,r,IFlag) (ghi,3,11,r,IFlag) (jk,4,11,r,IFlag) L11 1 3 4 2 r r r r 2014/10/21 http://datamining.xmu.edu.cn 25/30

Combining Pass-Join-K with Hadoop Substrings in hadoop Suppose tau = 3, k = 1, and s = “abdefghk”, length(s) = 8. We have to generate some records such as (a,1,5,s,SFlag),(a,2,6,s,SFlag)(a,3,7,s,SFlag),(ab,1,8,s,SFlag),…,(ab,1,11,s,SFlag),… 2014/10/21 http://datamining.xmu.edu.cn 26/30

Combining Pass-Join-K with Hadoop Data flows in hadoop 2014/10/21 http://datamining.xmu.edu.cn 27/30

Combining Pass-Join-K with Hadoop Big data Big file Large number of files 2014/10/21 http://datamining.xmu.edu.cn 28/30

Combining Pass-Join-K with Hadoop [segmentString, segmentNumber, stringLength, FLAG], [DirNumber, ID] 2014/10/21 http://datamining.xmu.edu.cn 29/30

Email: yhycai@gmail.com Thanks for patience 2014/10/21 http://datamining.xmu.edu.cn 30/30

L arge-scale Similarity Join with Edit-distance Constraints

L arge-scale Similarity Join with Edit-distance Constraints

Presentation Transcript

Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints

Top-k String Similarity Search with Edit-Distance Constraints

Efficient Approximate Entity Extraction with Edit Distance Constraints

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Trie -Join : Efficient Trie -based String Similarity Joins with Edit Distance Constraints

STS: Tempora l and Spatial Constraints on Text Similarity

Minimum Edit Distance

Minimum Edit Distance

String Edit Distance Matching Problem With Moves

L arge-scale U ncoupled A ction U pdate

Minimum Edit Distance

Edit Distance

Minimum Edit Distance

Google Similarity Distance

Efficient Approximate Entity Extraction with Edit Distance Constraints

Similarity join problem with Pass-Join-K using Hadoop

Minimum Edit Distance

On Embedding Edit Distance into L 1

Edit Distance