310 likes | 330 Views
Find common customers across multiple customer databases by joining tuples with similar string fields. Use the Edit Distance similarity measure and solve the problem in a database-friendly way.
E N D
The Need for String Joins • Substantial amounts of data in existing DBMSs are strings • Often, there is a need to correlate data stored in different tables • Example: Find common customers across various customer databases Columbia University Computer Science Dept.
Problems with Exact String Joins • Typing mistakes (e.g., John vs. Jonh) • No standard way of recording string data • Standard equijoins do not “forgive such mistakes” = ∅ ⋈ Columbia University Computer Science Dept.
Approximate String Joins • We want to join tuples with “similar” string fields • Similarity measure: Edit Distance • Each Insertion, Deletion, Replacement increases distance by one Columbia University Computer Science Dept.
Our Focus: Approximate String Joins over Relational DBMSs • Join two tables on string attributes and keep all pairs of strings with Edit Distance ≤ K • Solve the problem in a database-friendly way (if possible with an existing "vanilla" RDBMS) Columbia University Computer Science Dept.
Related Work • Data Cleaning • Hernandez and Stolfo, DMKD Journal, 2 (1), 1998 • Monge and Elkan, SIGMOD DMKD Workshop, 1997 • ... • Approximate String Matching • Baeza-Yates and Gonnet, SPIRE 1999 • Sutinen and Tarhio, ESA’95, CPM’96 • Smith and Waterman, Journal of Molecular Biology 147, 1981 • Ukkonen, TCS 92(1), 1992 • Ullman, Computer Journal, 1977 • … Columbia University Computer Science Dept.
Current Approaches for Processing Approximate String Joins No native support for approximate joins in RDBMSs Two existing (straightforward) solutions: • Join data outside of DBMS • Join data via user-defined functions (UDFs) inside the DBMS Columbia University Computer Science Dept.
Approximate String Joins outside of a DBMS • Export data • Join outside of DBMS • Import the result Main advantage: We can exploit any state-of-the-art string-matching algorithm, without restrictions from DBMS functionality Disadvantages: • Substantial amounts of data to be exported/imported • Cannot be easily integrated with further processing steps in the DBMS Columbia University Computer Science Dept.
Approximate String Joins with UDFs • Write a UDF to check if two strings match within distance K • Write an SQL statement that applies the UDF to the string pairs SELECT R.stringAttr, S.stringAttr FROM R, S WHERE edit_distance(R.stringAttr, S.stringAttr, K) Main advantage: Ease of implementation Main disadvantage: UDF applied to entire cross-product of relations Columbia University Computer Science Dept.
Our Approach: Approximate String Joins over an Unmodified RDBMS • Preprocess data and generate auxiliary tables • Perform join exploiting standard RDBMS capabilities Advantages • No modification of underlying RDBMS needed. • Can leverage the RDBMS query optimizer. • Much more efficient than the approach based on naive UDFs Columbia University Computer Science Dept.
Our Approach: Intuition and Roadmap • Intuition: • Similar strings have many common substrings • Use exact joins to perform approximate joins (current DBMSs are good for exact joins) • A good candidate set can be verified for false positives [Ukkonen 1992, Sutinen and Tarhio 1996, Ullman 1977] • Roadmap: • Break strings into substrings of length q (q-grams) • Perform an exact join on the q-grams • Find candidate string pairs based on the results • Check only candidate pairs with a UDF to obtain final answer Columbia University Computer Science Dept.
What is a “Q-gram”? • Q-gram: A sequence of q characters of the original string Example for q=3 vacations {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns$, s$$} String with length L → L + q - 1 q-grams • Similar stringshave a many common q-grams Columbia University Computer Science Dept.
Q-grams and Edit Distance Operations • With no edits: L + q - 1 common q-grams • Replacement: (L + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacalions: {##v, #va, vac, aca, cal, ali, lio, ion, ons, ns#, s$$} • Insertion: (Lmax + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacatlions: {##v, #va, vac, aca, cat, atl, tli, lio, ion, ons, ns#, s$$} • Deletion: (Lmax + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacaions: {##v, #va, vac, aca, cai, aio, ion, ons, ns#, s$$} Columbia University Computer Science Dept.
“Q-gram Distance” and Edit Distance • In the “q-gram space” the distance for a pair of strings (S1,S2) is defined as: • |Maximum number of q-grams| - |Common q-grams| • Each Edit Distance operation affects a q-gram in the following ways: • It destroys it, or • It leaves it intact, or • It shifts it by one position • Each Edit Distance operation destroys at most q q-grams • For Edit Distance = K, the “q-gram distance” is at most Kq Columbia University Computer Science Dept.
Number of Common Q-grams and Edit Distance • For Edit Distance = K, there could be at most K replacements, insertions, deletions • Two strings S1 and S2 with Edit Distance ≤ K have at least [max(S1.len, S2.len) + q - 1] – Kq q-grams in common • Useful filter: eliminate all string pairs without "enough" common q-grams (no false dismissals) Columbia University Computer Science Dept.
Using a DBMS for Q-gram Joins • If we have the q-grams in the DBMS, we can perform this counting efficiently. • Create auxiliary tables with tuples of the form: <sid, strlen, qgram> and join these tables • A GROUP BY – HAVING COUNT clause can perform the counting / filtering Columbia University Computer Science Dept.
Eliminating Candidate Pairs: COUNT FILTERING SQL for this filter: (parts omitted for clarity) SELECT R.sid, S.sid FROM R, S WHERER.qgram=S.qgram GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q - 1) – K*q The result is the pair of strings with sufficiently enough common q-grams to ensure that we will not have false negatives. Columbia University Computer Science Dept.
Eliminating Candidate Pairs Further: LENGTH FILTERING Strings with length difference larger than K cannot be within Edit Distance K SELECT R.sid, S.sid FROM R, S WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q We refer to this filter as LENGTH FILTERING Columbia University Computer Science Dept.
Exploiting Q-gram Positions for Filtering • Consider strings aabbzzaacczz and aacczzaabbzz • Strings are at edit distance 4 • Strings have identical q-grams for q=3 Problem: Matching q-grams that are at different positions in both strings • Either q-grams do not "originate" from same q-gram, or • Too many edit operations "caused" spurious q-grams at various parts of strings to match Columbia University Computer Science Dept.
POSITION FILTERING - Filtering using positions • Keep the position of the q-grams <sid, strlen, pos, qgram> • Do not match q-grams that are more than K positions away SELECT R.sid, S.sid FROM R, S WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K AND abs(R.pos - S.pos)<=K GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q We refer to this filter as POSITION FILTERING Columbia University Computer Science Dept.
The Actual, Complete SQL Statement SELECT R1.string, S1.string, R1.sid, S1.sid FROM R1, S1, R, S, WHERE R1.sid=R.sid AND S1.sid=S.sid AND R.qgram=S.qgram AND abs(strlen(R1.string)–strlen(S1.string))<=K AND abs(R.pos - S.pos)<=K GROUP BY R1.sid, S1.sid, R1.string, S1.string HAVING COUNT(*) >= (max(strlen(R1.string),strlen(S1.string))+ q-1)–K*q Columbia University Computer Science Dept.
Experimental Results: Data • Three sets of customer data from AT&T Worldnet Service • 3 customer data sets from AT&T Worldnet: • (a) set1 with about 40K strings • (b) set2 and (c) set3 with about 30K strings each Columbia University Computer Science Dept.
DBMS Setup • Used Oracle 8i (supports UDFs), on Sun 20 Enterprise Server • Materialized the q-gram tables with entries <sid, qgram, pos> (less then 2 minutes per table) • Tested configurations with and without indexes on the auxiliary q-gram tables (less than 5 minutes to generate each index) The generation time for the auxiliary q-gram tables and indexes is small: Even on-the-fly materialization is feasible Columbia University Computer Science Dept.
Query Plans Generated by RDBMS • Naive approach with UDFs: nested-loops joins (prohibitively slow even for small data sets) • Q-gram approach: usually sort-merge joins • In our prototype implementation, sort-merge joins is the fastest as well Columbia University Computer Science Dept.
Naïve UDFs vs. Filtering For a subset of set1, our technique was 20 to 30 times faster than the naïve use of UDFs Columbia University Computer Science Dept.
Effect of Filters (Candidate Set) LENGTH FILTERING: 40-70% reduction for set1 (small length deviation) 90-98% reductions for set2, set3 (big length deviation) +COUNT FILTERING: > 99% reduction POSITION FILTERING: ~ 50% reduction (additionally) Columbia University Computer Science Dept.
Effect of Filters (Candidate Set) – Best Q • For the given data sets q=2 worked best • q=2 is close to the theoretical approximations as well • q=2 is small enough to avoid, as much as possible, the space overhead for the auxiliary tables Columbia University Computer Science Dept.
Effect of Filters (Q-gram Join Size) • COUNT FILTERING is applied last (in HAVING clause) • For efficiency, it is important to have a small join size for the q-gram tables • LENGTH FILTERING cuts the size by a factor of 2 to 10 • +POSITION FILTERING cuts the size by a factor of ~100 Columbia University Computer Science Dept.
Extensions: Substring Joins • Substring approximate joins: • Length filtering is not applicable • Substring matches have fewer common q-grams (no “overflow” q-grams) • Position filtering is not directly applicable (it needs tuning depending on the substring location) • We also propose a fast in-memory filter (see paper) • The result is not trivial! • Exploiting position of q-grams extensively to find possible alignments Columbia University Computer Science Dept.
Extensions: Block Moves • Match “AT&T Corp” with “Corp AT&T” as a unit cost operation. • We can allow block moves: • Length filtering is still applicable • The threshold for count filtering is different • Position filtering is not applicable (the q-gram may move far away) Columbia University Computer Science Dept.
Conclusions • We introduced a technique for mapping approximate string joins into a “vanilla” SQL expression • Our technique does not require modifying the underlying RDBMS • Our technique exploits the RDBMS's query optimizer • Our technique significantly outperforms existing approaches Many opportunities for improvements and future work! Columbia University Computer Science Dept.