Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints

Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints Chuan Xiao, Wei Wang, Xuemin Lin University of New South Wales & NICTA Australia CSE@UNSW

Motivation • Data Cleaning • typo: • multiple representation: ‘harbor’ vs ‘harbour’ • Bioinformatics • DNA/protein sequence • AAAGTCTGAC… • AAACTCTGAC… ‘Steven Spielberg’ ‘Stephen Spielburg’ CSE@UNSW

More Applications SPAM EMAIL TEMPLATE Sir/Madam, We happily announce to you the draw of the EURO MILLIONS SPANISH LOTTERY INTERNATIONAL WINNINGS PROGRAM PROMOTIONS held on the 27TH MARCH 2008 in SPAIN. Your company or your personal e-mail address attached to ticket number 653-908-321-675 with serial main number <NUMBER> drew lucky star winning numbers <NUMBER> which consequently won in the 2ND category, you have therefore been approved for a lump sum pay out of 960.000.00 Euros. (NINE HUNDRED AND SIXTY THOUSAND EUROS). CONGRATULATIONS!!! Sincerely yours, <NAME> <AFFILIATION> Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a unique check disk has been eliminated. 2. Read requests have a higher level of parallelism. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a dedicated check disk the check disk never participates in read. identify plagiarism detect spam Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a single check disk has been eliminated. 2. Read requests have a higher level of parallelism on RAID5. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a check disk the check disk never participates in read. CSE@UNSW

Outline • Motivation • Problem Definition • Algorithms • Experiments • Conclusions CSE@UNSW

Edit Similarity Join • Focus on similarity join on strings with edit distance threshold (d) • edit distance d  two strings are similar • Problem Definition • Given two collection of strings S and T, the edit similarity join problem is to compute { <s, t> | sS, tT, ed(s,t) d } • Consider the self-join case here CSE@UNSW

q-gram Based Filtering[Gravano et al. VLDB01] • Naïve algorithm • compute edit distance: O(n2) time complexity • do this for N2/2 pairs • q-gram based filtering • filter-and-refine • length filtering • | len(s)-len(t) | d New_Zealand New ew_ w_Z _Ze Zea eal ala lan and CSE@UNSW

Matching q-grams • count filtering • at least LB(s,t) common q-grams, where LB(s,t) = max(|s|, |t|) - q + 1 –q*d • position filtering • positions of common q-grams should be within d • Implemented on RDBMS • best performance when small q, such as q=2,3 New_Zealand New ew_ w_Z _Ze Zea eal ala lan and S S S S • destroy at most q*dq-grams  share most q-grams matching q-grams CSE@UNSW

Prefix Filter[Chaudhuri et al. ICDE06, Bayardo et al. WWW07] • Bottleneck: generating candidate pairs which share at least LB(s,t) matching q-grams • Prefix Filter • sort q-grams by global ordering, such as idf • Qs= • Qt= q*d+1 l-q*d-1 = LB(s,t)-1 CSE@UNSW

All-Pairs-Ed Algorithm[Bayardo et al. WWW07] Indexed Record Set Prefix Filter Cand-1 Generation Count Filter Cand-2 Generation Verification Edit Distance Result Pairs CSE@UNSW

Example – All-Pairs-Ed • d=1, q=2 • a=‘Austria’ • b=‘Australia’ • c=‘Australiana’ • d=‘New_Zealand’ • e=‘New_Sealand’ • after prefix filter: <a,b> <b,c> <d,e> • after count filter: <b,c> <d,e> • after edit distance verification: <d,e> prefix_len = q*d+1 = 3 • Qa={ri, Au, us, …} • Qb={ra, li, Au, …} • Qc={na, ra, li, …} • Qd={_Z, Ze, Ne, …} • Qe={_S, Se, Ne, …} CSE@UNSW

Ed-Join • Idea • mismatchingq-grams provide useful information CSE@UNSW

Location-Based Filtering • Idea: reduce prefix length • Example, d=1, q=2 • s=‘Austria’ • t=‘Australia’ • Qs= • Qt= location 5 1 pruned 5 7 location CSE@UNSW

Minimum Prefix Length q*d+1 • Qs = sequential search at least d+1 edit operations to destroy them 1 2 3 4 5 6 A C d=2, q=2 G A C G T A Further optimization: binary search within [d+1, q*d+1] min. prefix len. = 4 CSE@UNSW

Limit of Count/Loc.-Based Filter • Clustered edit operations • s=‘…please submit by Aug…’ • t=‘…please submit by Sep…’ • Non-clustered edit operations • s’=‘…please submit by Aug…’ • t’=‘…pleese supmit bi Aug…’ • Clustered edit operations destroy fewer q-grams  count/location-based filtering less effective 4 mismatching q-grams if q=2  retained (d=2) 6 mismatching q-grams if q=2  pruned (d=2) CSE@UNSW

Content-Based Filtering • Probing Window • An edit operation increases L1 distance within the probing window by at most two • L1 distance should be  2d if ed(s, t) d s t CSE@UNSW

Select Probing Window • Example, d=3, q=3 s t L1 = 2 L1 = 8 > 2d pruned CSE@UNSW

Example – Ed-Join • d=1, q=2 • a=‘Austria’ • b=‘Australia’ • c=‘Australiana’ • d=‘New_Zealand’ • e=‘New_Sealand’ • after prefix filter: <b,c> <d,e> • after count filter: <b,c> <d,e> • after content-based filter: <d,e> • after edit distance verification: <d,e> • Qa={ri, Au, us, …} • Qb={ra, li, Au, …} • Qc={na, ra, li, …} • Qd={_Z, Ze, Ne, …} • Qe={_S, Se, Ne, …} • Qa={ri, Au, …} • Qb={ra, li, …} • Qc={na, ra, …} • Qd={_Z, Ze, Ne, …} • Qe={_S, Se, Ne, …} CSE@UNSW

Experiment Settings • Environment • Intel Xeon X3220 2.4GHz CPU, 4GB RAM • Debian 4.1, GCC 4.1.2 with –O3 • Algorithm • All-Pairs-Ed [Bayardo et al. WWW07] • PartEnum [Arasu et al. VLDB06] • Ed-Join / Ed-Join-l • Dataset CSE@UNSW

Experiment – Large Threshold • UNIREF, Running Time CSE@UNSW

Experiment - q • TREC, Running Time • q=8 achieves best performance for TREC CSE@UNSW

Experiment - with PartEnum d=1 d=2 d=3 CSE@UNSW

Conclusions • Contributions • an efficient algorithm for edit similarity join • exploit mismatchingq-grams • location-based filtering – non-clustered edit ops. • content-based filtering – clustered edit ops. • longer q-grams perform best for stand-alone implementation • Future work • other similarity measures, e.g., used in DNA/protein alignment CSE@UNSW

Thank you! Additional Materials Available at http://www.cse.unsw.edu.au/~weiw/project/simjoin.html CSE@UNSW

Related Work • q-qram Based Filtering • L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001. • Algorithms to Set Similarity Join • Index-based approaches • S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004. • C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. in ICDE, 2008. • Prefix-based approaches • S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006. • R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007. • C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, 2008. • PartEnum • A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006. CSE@UNSW

Related Work • Edit Distance Computation • R. A. Wagner and M. J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168–173, 1974. • W. J. Masek and M. Paterson. A faster algorithm computing string edit distances. J. Comput. Syst. Sci., 20(1):18–31, 1980. • G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic Programming. J. ACM, 46(3):395–415, 1999. • E. Ukkonen. On approximate string matching. In FCT, 1983. CSE@UNSW

Experiment – Pruning Power CSE@UNSW

Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints

Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints

Presentation Transcript

Top-k String Similarity Search with Edit-Distance Constraints

Efficient Approximate Entity Extraction with Edit Distance Constraints

An efficient image segmentation algorithm using bidirectional Mahalanobis distance

Trie -Join : Efficient Trie -based String Similarity Joins with Edit Distance Constraints

Fast -Join : An Efficient Method for Fuzzy Token Matching based String Similarity Join

HmSearch : An Efficient Hamming Distance Query Processing Algorithm

Efficient Approximation of Edit Distance

An Efficient Video Similarity Search Algorithm

Efficient Parallel Set-Similarity Joins Using Hadoop

Efficient Exact Set-Similarity Joins

An Efficient Algorithm for Scheduling Instructions with Deadline Constraints on ILP Machines

Edit Distance

String Similarity Measures and Joins with Synonyms

Similarity Joins for Strings and Sets

Similarity Joins for Strings and Sets

L arge-scale Similarity Join with Edit-distance Constraints

Efficient Approximate Entity Extraction with Edit Distance Constraints

An Efficient Video Similarity Search Algorithm

Edit Distance