350 likes | 479 Views
Modified by weiw. Approximate String Joins in a Database (Almost) for Free. Published in VLDB 2001. The Need for String Joins. Substantial amounts of data in existing DBMSs are strings Often, there is a need to correlate data stored in different tables
E N D
Modified by weiw Approximate String Joins in a Database (Almost) for Free Published in VLDB 2001
The Need for String Joins • Substantial amounts of data in existing DBMSs are strings • Often, there is a need to correlate data stored in different tables • Example: Find common customers across various customer databases Columbia University Computer Science Dept.
Problems with Exact String Joins • Typing mistakes (e.g., John vs. Jonh) • No standard way of recording string data • Standard equijoins do not “forgive such mistakes” = ∅ ⋈ Columbia University Computer Science Dept.
Google “britany spears” Approximate String Joins • We want to join tuples with “similar” string fields • Similarity measure: Edit Distance • Each Insertion, Deletion, Replacement increases distance by one Columbia University Computer Science Dept.
Edit Distance • Also known as “Levenshtein distance”: http://en.wikipedia.org/wiki/Levenshtein_distance • Distance is shortest sequence of edit commands that transform s to t. • Simplest set of operations: • Copy character from s over to t • Delete a character in s (cost 1) • Insert a character in t (cost 1) • Substitute one character for another (cost 1) Columbia University Computer Science Dept.
Levenstein distance - example • distance(“William Cohen”, “Willliam Cohon”) s alignment t op cost Columbia University Computer Science Dept.
Levenstein distance - example • distance(“William Cohen”, “Willliam Cohon”) s gap alignment t op cost Columbia University Computer Science Dept.
D(i-1,j-1), if si=tj //copy D(i-1,j-1)+1, if si!=tj //substitute D(i-1,j)+1 //delete D(i,j-1)+1 //insert = min Computing Levenstein distance - 1 D(i,j) = score of best alignment from s1..si to t1..tj Columbia University Computer Science Dept.
Computing Levenstein distance - 2 D(i,j) = score of best alignment from s1..si to t1..tj D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //delete D(i,j-1)+1 //insert = min (simplify by letting d(c,d)=0 if c=d, 1 else) also let D(i,0)=i (for i inserts) and D(0,j)=j Columbia University Computer Science Dept.
Our Focus: Approximate String Joins over Relational DBMSs • Join two tables on string attributes and keep all pairs of strings with Edit Distance ≤ K • Solve the problem in a database-friendly way (if possible with an existing "vanilla" RDBMS) Columbia University Computer Science Dept.
Related Work • Data Cleaning • Hernandez and Stolfo, DMKD Journal, 2 (1), 1998 • Monge and Elkan, SIGMOD DMKD Workshop, 1997 • ... • Approximate String Matching • Baeza-Yates and Gonnet, SPIRE 1999 • Sutinen and Tarhio, ESA’95, CPM’96 • Smith and Waterman, Journal of Molecular Biology 147, 1981 • Ukkonen, TCS 92(1), 1992 • Ullman, Computer Journal, 1977 • … Columbia University Computer Science Dept.
Current Approaches for Processing Approximate String Joins No native support for approximate joins in RDBMSs Two existing (straightforward) solutions: • Join data outside of DBMS • Join data via user-defined functions (UDFs) inside the DBMS Columbia University Computer Science Dept.
Approximate String Joins outside of a DBMS • Export data • Join outside of DBMS • Import the result Main advantage: We can exploit any state-of-the-art string-matching algorithm, without restrictions from DBMS functionality Disadvantages: • Substantial amounts of data to be exported/imported • Cannot be easily integrated with further processing steps in the DBMS Columbia University Computer Science Dept.
Approximate String Joins with UDFs • Write a UDF to check if two strings match within distance K • Write an SQL statement that applies the UDF to the string pairs SELECT R.stringAttr, S.stringAttr FROM R, S WHERE edit_distance(R.stringAttr, S.stringAttr, K) Main advantage: Ease of implementation Main disadvantage: UDF applied to entire cross-product of relations Columbia University Computer Science Dept.
Our Approach: Approximate String Joins over an Unmodified RDBMS • Preprocess data and generate auxiliary tables • Perform join exploiting standard RDBMS capabilities Advantages • No modification of underlying RDBMS needed. • Can leverage the RDBMS query optimizer. • Much more efficient than the approach based on naive UDFs Columbia University Computer Science Dept.
Our Approach: Intuition and Roadmap • Intuition: • Similar strings have many common substrings • Use exact joins to perform approximate joins (current DBMSs are good for exact joins) • A good candidate set can be verified for false positives [Ukkonen 1992, Sutinen and Tarhio 1996, Ullman 1977] • Roadmap: • Break strings into substrings of length q (q-grams) • Perform an exact join on the q-grams • Find candidate string pairs based on the results • Check only candidate pairs with a UDF to obtain final answer Columbia University Computer Science Dept.
What is a “Q-gram”? • Q-gram: A sequence of q characters of the original string • Example for q=3 • String with length L → L + q - 1 q-grams • Similar stringshave a many common q-grams vacations { ##v #va vac aca cat ati tio ion ons ns$ s$$ } Columbia University Computer Science Dept.
Q-grams and Edit Distance Operations • With no edits: L + q - 1 common q-grams • Replacement: (L + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacalions: {##v, #va, vac, aca, cal, ali, lio, ion, ons, ns#, s$$} • Insertion: (Lmax + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacatlions: {##v, #va, vac, aca, cat, atl, tli, lio, ion, ons, ns#, s$$} • Deletion: (Lmax + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacaions: {##v, #va, vac, aca, cai, aio, ion, ons, ns#, s$$} Columbia University Computer Science Dept.
“Q-gram Distance” and Edit Distance • In the “q-gram space” the distance for a pair of strings (S1,S2) is defined as: • |Maximum number of q-grams| - |Common q-grams| • Each Edit Distance operation affects a q-gram in the following ways: • It destroys it, or • It leaves it intact, or • It shifts it by one position • Each Edit Distance operation destroys at most q q-grams • For Edit Distance = K, the “q-gram distance” is at most Kq Columbia University Computer Science Dept.
Number of Common Q-grams and Edit Distance • For Edit Distance = K, there could be at most K replacements, insertions, deletions • Two strings S1 and S2 with Edit Distance ≤ K have at least [max(S1.len, S2.len) + q - 1] – Kq q-grams in common • But no the other way around! (e.g.?) • Useful filter: eliminate all string pairs without "enough" common q-grams (no false dismissals) Columbia University Computer Science Dept.
Using a DBMS for Q-gram Joins • If we have the q-grams in the DBMS, we can perform this counting efficiently. • Create auxiliary tables with tuples of the form: <sid, strlen, qgram> and join these tables • A GROUP BY – HAVING COUNT clause can perform the counting / filtering Columbia University Computer Science Dept.
Eliminating Candidate Pairs: COUNT FILTERING SQL for this filter: (parts omitted for clarity) SELECT R.sid, S.sid FROM R, S WHERER.qgram=S.qgram GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q - 1) – K*q The result is the pair of strings with sufficiently enough common q-grams to ensure that we will not have false negatives. Columbia University Computer Science Dept.
Eliminating Candidate Pairs Further: LENGTH FILTERING Strings with length difference larger than K cannot be within Edit Distance K SELECT R.sid, S.sid FROM R, S WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q We refer to this filter as LENGTH FILTERING Columbia University Computer Science Dept.
Exploiting Q-gram Positions for Filtering • Consider strings aabbzzaacczz and aacczzaabbzz • Strings are at edit distance 4 • Strings have identical q-grams for q=3 Problem: Matching q-grams that are at different positions in both strings • Either q-grams do not "originate" from same q-gram, or • Too many edit operations "caused" spurious q-grams at various parts of strings to match Columbia University Computer Science Dept.
POSITION FILTERING - Filtering using positions • Keep the position of the q-grams <sid, strlen, pos, qgram> • Do not match q-grams that are more than K positions away SELECT R.sid, S.sid FROM R, S WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K AND abs(R.pos - S.pos)<=K GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q We refer to this filter as POSITION FILTERING Columbia University Computer Science Dept.
The Actual, Complete SQL Statement SELECT R1.string, S1.string, R1.sid, S1.sid FROM R1, S1, R, S, WHERE R1.sid=R.sid AND S1.sid=S.sid AND R.qgram=S.qgram AND abs(strlen(R1.string)–strlen(S1.string))<=K AND abs(R.pos - S.pos)<=K GROUP BY R1.sid, S1.sid, R1.string, S1.string HAVING COUNT(*) >= (max(strlen(R1.string),strlen(S1.string))+ q-1)–K*q Columbia University Computer Science Dept.
Note: (1) we do not add #/$ to save space (2) still need to go thru edit distance check via UDF Example (K = 1) Question: what if S3 = dxb? R1 S1 Join R S gram |R.len-S.len|≤K |R.pos-S.pos|≤K Group-By count(*) ≥ mincount mincount = max(4,3)-1-1*2 = 1 mincount = max(3,3)-1-1*2 = 0
Experimental Results: Data • Three sets of customer data from AT&T Worldnet Service • 3 customer data sets from AT&T Worldnet: • (a) set1 with about 40K strings • (b) set2 and (c) set3 with about 30K strings each Columbia University Computer Science Dept.
DBMS Setup • Used Oracle 8i (supports UDFs), on Sun 20 Enterprise Server • Materialized the q-gram tables with entries <sid, qgram, pos> (less then 2 minutes per table) • Tested configurations with and without indexes on the auxiliary q-gram tables (less than 5 minutes to generate each index) The generation time for the auxiliary q-gram tables and indexes is small: Even on-the-fly materialization is feasible Columbia University Computer Science Dept.
Query Plans Generated by RDBMS • Naive approach with UDFs: nested-loops joins (prohibitively slow even for small data sets) • Q-gram approach: usually sort-merge joins • In our prototype implementation, sort-merge joins is the fastest as well Columbia University Computer Science Dept.
Naïve UDFs vs. Filtering For a subset of set1, our technique was 20 to 30 times faster than the naïve use of UDFs Columbia University Computer Science Dept.
Effect of Filters (Candidate Set) LENGTH FILTERING: 40-70% reduction for set1 (small length deviation) 90-98% reductions for set2, set3 (big length deviation) +COUNT FILTERING: > 99% reduction POSITION FILTERING: ~ 50% reduction (additionally) Columbia University Computer Science Dept.
Effect of Filters (Candidate Set) – Best Q • For the given data sets q=2 worked best • q=2 is close to the theoretical approximations as well • q=2 is small enough to avoid, as much as possible, the space overhead for the auxiliary tables Columbia University Computer Science Dept.
Effect of Filters (Q-gram Join Size) • COUNT FILTERING is applied last (in HAVING clause) • For efficiency, it is important to have a small join size for the q-gram tables • LENGTH FILTERING cuts the size by a factor of 2 to 10 • +POSITION FILTERING cuts the size by a factor of ~100 Columbia University Computer Science Dept.
Conclusions • We introduced a technique for mapping approximate string joins into a “vanilla” SQL expression • Our technique does not require modifying the underlying RDBMS • Our technique exploits the RDBMS's query optimizer • Our technique significantly outperforms existing approaches Many opportunities for improvements and future work! Columbia University Computer Science Dept.