280 likes | 483 Views
Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints. Chuan Xiao, Wei Wang, Xuemin Lin University of New South Wales & NICTA Australia. Motivation. Data Cleaning typo: multiple representation: ‘harbor’ vs ‘harbo u r’ Bioinformatics DNA/protein sequence
E N D
Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints Chuan Xiao, Wei Wang, Xuemin Lin University of New South Wales & NICTA Australia CSE@UNSW
Motivation • Data Cleaning • typo: • multiple representation: ‘harbor’ vs ‘harbour’ • Bioinformatics • DNA/protein sequence • AAAGTCTGAC… • AAACTCTGAC… ‘Steven Spielberg’ ‘Stephen Spielburg’ CSE@UNSW
More Applications SPAM EMAIL TEMPLATE Sir/Madam, We happily announce to you the draw of the EURO MILLIONS SPANISH LOTTERY INTERNATIONAL WINNINGS PROGRAM PROMOTIONS held on the 27TH MARCH 2008 in SPAIN. Your company or your personal e-mail address attached to ticket number 653-908-321-675 with serial main number <NUMBER> drew lucky star winning numbers <NUMBER> which consequently won in the 2ND category, you have therefore been approved for a lump sum pay out of 960.000.00 Euros. (NINE HUNDRED AND SIXTY THOUSAND EUROS). CONGRATULATIONS!!! Sincerely yours, <NAME> <AFFILIATION> Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a unique check disk has been eliminated. 2. Read requests have a higher level of parallelism. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a dedicated check disk the check disk never participates in read. identify plagiarism detect spam Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a single check disk has been eliminated. 2. Read requests have a higher level of parallelism on RAID5. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a check disk the check disk never participates in read. CSE@UNSW
Outline • Motivation • Problem Definition • Algorithms • Experiments • Conclusions CSE@UNSW
Edit Similarity Join • Focus on similarity join on strings with edit distance threshold (d) • edit distance d two strings are similar • Problem Definition • Given two collection of strings S and T, the edit similarity join problem is to compute { <s, t> | sS, tT, ed(s,t) d } • Consider the self-join case here CSE@UNSW
Outline • Motivation • Problem Definition • Algorithms • Experiments • Conclusions CSE@UNSW
q-gram Based Filtering[Gravano et al. VLDB01] • Naïve algorithm • compute edit distance: O(n2) time complexity • do this for N2/2 pairs • q-gram based filtering • filter-and-refine • length filtering • | len(s)-len(t) | d New_Zealand New ew_ w_Z _Ze Zea eal ala lan and CSE@UNSW
Matching q-grams • count filtering • at least LB(s,t) common q-grams, where LB(s,t) = max(|s|, |t|) - q + 1 –q*d • position filtering • positions of common q-grams should be within d • Implemented on RDBMS • best performance when small q, such as q=2,3 New_Zealand New ew_ w_Z _Ze Zea eal ala lan and S S S S • destroy at most q*dq-grams share most q-grams matching q-grams CSE@UNSW
Prefix Filter[Chaudhuri et al. ICDE06, Bayardo et al. WWW07] • Bottleneck: generating candidate pairs which share at least LB(s,t) matching q-grams • Prefix Filter • sort q-grams by global ordering, such as idf • Qs= • Qt= q*d+1 l-q*d-1 = LB(s,t)-1 CSE@UNSW
All-Pairs-Ed Algorithm[Bayardo et al. WWW07] Indexed Record Set Prefix Filter Cand-1 Generation Count Filter Cand-2 Generation Verification Edit Distance Result Pairs CSE@UNSW
Example – All-Pairs-Ed • d=1, q=2 • a=‘Austria’ • b=‘Australia’ • c=‘Australiana’ • d=‘New_Zealand’ • e=‘New_Sealand’ • after prefix filter: <a,b> <b,c> <d,e> • after count filter: <b,c> <d,e> • after edit distance verification: <d,e> prefix_len = q*d+1 = 3 • Qa={ri, Au, us, …} • Qb={ra, li, Au, …} • Qc={na, ra, li, …} • Qd={_Z, Ze, Ne, …} • Qe={_S, Se, Ne, …} CSE@UNSW
Ed-Join • Idea • mismatchingq-grams provide useful information CSE@UNSW
Location-Based Filtering • Idea: reduce prefix length • Example, d=1, q=2 • s=‘Austria’ • t=‘Australia’ • Qs= • Qt= location 5 1 pruned 5 7 location CSE@UNSW
Minimum Prefix Length q*d+1 • Qs = sequential search at least d+1 edit operations to destroy them 1 2 3 4 5 6 A C d=2, q=2 G A C G T A Further optimization: binary search within [d+1, q*d+1] min. prefix len. = 4 CSE@UNSW
Limit of Count/Loc.-Based Filter • Clustered edit operations • s=‘…please submit by Aug…’ • t=‘…please submit by Sep…’ • Non-clustered edit operations • s’=‘…please submit by Aug…’ • t’=‘…pleese supmit bi Aug…’ • Clustered edit operations destroy fewer q-grams count/location-based filtering less effective 4 mismatching q-grams if q=2 retained (d=2) 6 mismatching q-grams if q=2 pruned (d=2) CSE@UNSW
Content-Based Filtering • Probing Window • An edit operation increases L1 distance within the probing window by at most two • L1 distance should be 2d if ed(s, t) d s t CSE@UNSW
Select Probing Window • Example, d=3, q=3 s t L1 = 2 L1 = 8 > 2d pruned CSE@UNSW
Example – Ed-Join • d=1, q=2 • a=‘Austria’ • b=‘Australia’ • c=‘Australiana’ • d=‘New_Zealand’ • e=‘New_Sealand’ • after prefix filter: <b,c> <d,e> • after count filter: <b,c> <d,e> • after content-based filter: <d,e> • after edit distance verification: <d,e> • Qa={ri, Au, us, …} • Qb={ra, li, Au, …} • Qc={na, ra, li, …} • Qd={_Z, Ze, Ne, …} • Qe={_S, Se, Ne, …} • Qa={ri, Au, …} • Qb={ra, li, …} • Qc={na, ra, …} • Qd={_Z, Ze, Ne, …} • Qe={_S, Se, Ne, …} CSE@UNSW
Outline • Motivation • Problem Definition • Algorithms • Experiments • Conclusions CSE@UNSW
Experiment Settings • Environment • Intel Xeon X3220 2.4GHz CPU, 4GB RAM • Debian 4.1, GCC 4.1.2 with –O3 • Algorithm • All-Pairs-Ed [Bayardo et al. WWW07] • PartEnum [Arasu et al. VLDB06] • Ed-Join / Ed-Join-l • Dataset CSE@UNSW
Experiment – Large Threshold • UNIREF, Running Time CSE@UNSW
Experiment - q • TREC, Running Time • q=8 achieves best performance for TREC CSE@UNSW
Experiment - with PartEnum d=1 d=2 d=3 CSE@UNSW
Conclusions • Contributions • an efficient algorithm for edit similarity join • exploit mismatchingq-grams • location-based filtering – non-clustered edit ops. • content-based filtering – clustered edit ops. • longer q-grams perform best for stand-alone implementation • Future work • other similarity measures, e.g., used in DNA/protein alignment CSE@UNSW
Thank you! Additional Materials Available at http://www.cse.unsw.edu.au/~weiw/project/simjoin.html CSE@UNSW
Related Work • q-qram Based Filtering • L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001. • Algorithms to Set Similarity Join • Index-based approaches • S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004. • C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. in ICDE, 2008. • Prefix-based approaches • S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006. • R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007. • C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, 2008. • PartEnum • A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006. CSE@UNSW
Related Work • Edit Distance Computation • R. A. Wagner and M. J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168–173, 1974. • W. J. Masek and M. Paterson. A faster algorithm computing string edit distances. J. Comput. Syst. Sci., 20(1):18–31, 1980. • G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic Programming. J. ACM, 46(3):395–415, 1999. • E. Ukkonen. On approximate string matching. In FCT, 1983. CSE@UNSW
Experiment – Pruning Power CSE@UNSW