Approximate String Joins in a Database (Almost) for Free

Modified by weiw Approximate String Joins in a Database (Almost) for Free Published in VLDB 2001

The Need for String Joins • Substantial amounts of data in existing DBMSs are strings • Often, there is a need to correlate data stored in different tables • Example: Find common customers across various customer databases Columbia University Computer Science Dept.

Problems with Exact String Joins • Typing mistakes (e.g., John vs. Jonh) • No standard way of recording string data • Standard equijoins do not “forgive such mistakes” = ∅ ⋈ Columbia University Computer Science Dept.

Google “britany spears” Approximate String Joins • We want to join tuples with “similar” string fields • Similarity measure: Edit Distance • Each Insertion, Deletion, Replacement increases distance by one Columbia University Computer Science Dept.

Edit Distance • Also known as “Levenshtein distance”: http://en.wikipedia.org/wiki/Levenshtein_distance • Distance is shortest sequence of edit commands that transform s to t. • Simplest set of operations: • Copy character from s over to t • Delete a character in s (cost 1) • Insert a character in t (cost 1) • Substitute one character for another (cost 1) Columbia University Computer Science Dept.

Levenstein distance - example • distance(“William Cohen”, “Willliam Cohon”) s alignment t op cost Columbia University Computer Science Dept.

Levenstein distance - example • distance(“William Cohen”, “Willliam Cohon”) s gap alignment t op cost Columbia University Computer Science Dept.

D(i-1,j-1), if si=tj //copy D(i-1,j-1)+1, if si!=tj //substitute D(i-1,j)+1 //delete D(i,j-1)+1 //insert = min Computing Levenstein distance - 1 D(i,j) = score of best alignment from s1..si to t1..tj Columbia University Computer Science Dept.

Computing Levenstein distance - 2 D(i,j) = score of best alignment from s1..si to t1..tj D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //delete D(i,j-1)+1 //insert = min (simplify by letting d(c,d)=0 if c=d, 1 else) also let D(i,0)=i (for i inserts) and D(0,j)=j Columbia University Computer Science Dept.

Our Focus: Approximate String Joins over Relational DBMSs • Join two tables on string attributes and keep all pairs of strings with Edit Distance ≤ K • Solve the problem in a database-friendly way (if possible with an existing "vanilla" RDBMS) Columbia University Computer Science Dept.

Related Work • Data Cleaning • Hernandez and Stolfo, DMKD Journal, 2 (1), 1998 • Monge and Elkan, SIGMOD DMKD Workshop, 1997 • ... • Approximate String Matching • Baeza-Yates and Gonnet, SPIRE 1999 • Sutinen and Tarhio, ESA’95, CPM’96 • Smith and Waterman, Journal of Molecular Biology 147, 1981 • Ukkonen, TCS 92(1), 1992 • Ullman, Computer Journal, 1977 • … Columbia University Computer Science Dept.

Current Approaches for Processing Approximate String Joins No native support for approximate joins in RDBMSs Two existing (straightforward) solutions: • Join data outside of DBMS • Join data via user-defined functions (UDFs) inside the DBMS Columbia University Computer Science Dept.

Approximate String Joins outside of a DBMS • Export data • Join outside of DBMS • Import the result Main advantage: We can exploit any state-of-the-art string-matching algorithm, without restrictions from DBMS functionality Disadvantages: • Substantial amounts of data to be exported/imported • Cannot be easily integrated with further processing steps in the DBMS Columbia University Computer Science Dept.

Approximate String Joins with UDFs • Write a UDF to check if two strings match within distance K • Write an SQL statement that applies the UDF to the string pairs SELECT R.stringAttr, S.stringAttr FROM R, S WHERE edit_distance(R.stringAttr, S.stringAttr, K) Main advantage: Ease of implementation Main disadvantage: UDF applied to entire cross-product of relations Columbia University Computer Science Dept.

Our Approach: Approximate String Joins over an Unmodified RDBMS • Preprocess data and generate auxiliary tables • Perform join exploiting standard RDBMS capabilities Advantages • No modification of underlying RDBMS needed. • Can leverage the RDBMS query optimizer. • Much more efficient than the approach based on naive UDFs Columbia University Computer Science Dept.

Our Approach: Intuition and Roadmap • Intuition: • Similar strings have many common substrings • Use exact joins to perform approximate joins (current DBMSs are good for exact joins) • A good candidate set can be verified for false positives [Ukkonen 1992, Sutinen and Tarhio 1996, Ullman 1977] • Roadmap: • Break strings into substrings of length q (q-grams) • Perform an exact join on the q-grams • Find candidate string pairs based on the results • Check only candidate pairs with a UDF to obtain final answer Columbia University Computer Science Dept.

What is a “Q-gram”? • Q-gram: A sequence of q characters of the original string • Example for q=3 • String with length L → L + q - 1 q-grams • Similar stringshave a many common q-grams vacations { ##v #va vac aca cat ati tio ion ons ns$ s$$ } Columbia University Computer Science Dept.

Q-grams and Edit Distance Operations • With no edits: L + q - 1 common q-grams • Replacement: (L + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacalions: {##v, #va, vac, aca, cal, ali, lio, ion, ons, ns#, s$$} • Insertion: (Lmax + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacatlions: {##v, #va, vac, aca, cat, atl, tli, lio, ion, ons, ns#, s$$} • Deletion: (Lmax + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacaions: {##v, #va, vac, aca, cai, aio, ion, ons, ns#, s$$} Columbia University Computer Science Dept.

“Q-gram Distance” and Edit Distance • In the “q-gram space” the distance for a pair of strings (S1,S2) is defined as: • |Maximum number of q-grams| - |Common q-grams| • Each Edit Distance operation affects a q-gram in the following ways: • It destroys it, or • It leaves it intact, or • It shifts it by one position • Each Edit Distance operation destroys at most q q-grams • For Edit Distance = K, the “q-gram distance” is at most Kq Columbia University Computer Science Dept.

Number of Common Q-grams and Edit Distance • For Edit Distance = K, there could be at most K replacements, insertions, deletions • Two strings S1 and S2 with Edit Distance ≤ K have at least [max(S1.len, S2.len) + q - 1] – Kq q-grams in common • But no the other way around! (e.g.?) • Useful filter: eliminate all string pairs without "enough" common q-grams (no false dismissals) Columbia University Computer Science Dept.

Using a DBMS for Q-gram Joins • If we have the q-grams in the DBMS, we can perform this counting efficiently. • Create auxiliary tables with tuples of the form: <sid, strlen, qgram> and join these tables • A GROUP BY – HAVING COUNT clause can perform the counting / filtering Columbia University Computer Science Dept.

Eliminating Candidate Pairs: COUNT FILTERING SQL for this filter: (parts omitted for clarity) SELECT R.sid, S.sid FROM R, S WHERER.qgram=S.qgram GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q - 1) – K*q The result is the pair of strings with sufficiently enough common q-grams to ensure that we will not have false negatives. Columbia University Computer Science Dept.

Eliminating Candidate Pairs Further: LENGTH FILTERING Strings with length difference larger than K cannot be within Edit Distance K SELECT R.sid, S.sid FROM R, S WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q We refer to this filter as LENGTH FILTERING Columbia University Computer Science Dept.

Exploiting Q-gram Positions for Filtering • Consider strings aabbzzaacczz and aacczzaabbzz • Strings are at edit distance 4 • Strings have identical q-grams for q=3 Problem: Matching q-grams that are at different positions in both strings • Either q-grams do not "originate" from same q-gram, or • Too many edit operations "caused" spurious q-grams at various parts of strings to match Columbia University Computer Science Dept.

POSITION FILTERING - Filtering using positions • Keep the position of the q-grams <sid, strlen, pos, qgram> • Do not match q-grams that are more than K positions away SELECT R.sid, S.sid FROM R, S WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K AND abs(R.pos - S.pos)<=K GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q We refer to this filter as POSITION FILTERING Columbia University Computer Science Dept.

The Actual, Complete SQL Statement SELECT R1.string, S1.string, R1.sid, S1.sid FROM R1, S1, R, S, WHERE R1.sid=R.sid AND S1.sid=S.sid AND R.qgram=S.qgram AND abs(strlen(R1.string)–strlen(S1.string))<=K AND abs(R.pos - S.pos)<=K GROUP BY R1.sid, S1.sid, R1.string, S1.string HAVING COUNT(*) >= (max(strlen(R1.string),strlen(S1.string))+ q-1)–K*q Columbia University Computer Science Dept.

Note: (1) we do not add #/$ to save space (2) still need to go thru edit distance check via UDF Example (K = 1) Question: what if S3 = dxb? R1 S1 Join R S gram |R.len-S.len|≤K |R.pos-S.pos|≤K Group-By count(*) ≥ mincount mincount = max(4,3)-1-1*2 = 1 mincount = max(3,3)-1-1*2 = 0

Experimental Results: Data • Three sets of customer data from AT&T Worldnet Service • 3 customer data sets from AT&T Worldnet: • (a) set1 with about 40K strings • (b) set2 and (c) set3 with about 30K strings each Columbia University Computer Science Dept.

DBMS Setup • Used Oracle 8i (supports UDFs), on Sun 20 Enterprise Server • Materialized the q-gram tables with entries <sid, qgram, pos> (less then 2 minutes per table) • Tested configurations with and without indexes on the auxiliary q-gram tables (less than 5 minutes to generate each index) The generation time for the auxiliary q-gram tables and indexes is small: Even on-the-fly materialization is feasible Columbia University Computer Science Dept.

Query Plans Generated by RDBMS • Naive approach with UDFs: nested-loops joins (prohibitively slow even for small data sets) • Q-gram approach: usually sort-merge joins • In our prototype implementation, sort-merge joins is the fastest as well Columbia University Computer Science Dept.

Naïve UDFs vs. Filtering For a subset of set1, our technique was 20 to 30 times faster than the naïve use of UDFs Columbia University Computer Science Dept.

Effect of Filters (Candidate Set) LENGTH FILTERING: 40-70% reduction for set1 (small length deviation) 90-98% reductions for set2, set3 (big length deviation) +COUNT FILTERING: > 99% reduction POSITION FILTERING: ~ 50% reduction (additionally) Columbia University Computer Science Dept.

Effect of Filters (Candidate Set) – Best Q • For the given data sets q=2 worked best • q=2 is close to the theoretical approximations as well • q=2 is small enough to avoid, as much as possible, the space overhead for the auxiliary tables Columbia University Computer Science Dept.

Effect of Filters (Q-gram Join Size) • COUNT FILTERING is applied last (in HAVING clause) • For efficiency, it is important to have a small join size for the q-gram tables • LENGTH FILTERING cuts the size by a factor of 2 to 10 • +POSITION FILTERING cuts the size by a factor of ~100 Columbia University Computer Science Dept.

Conclusions • We introduced a technique for mapping approximate string joins into a “vanilla” SQL expression • Our technique does not require modifying the underlying RDBMS • Our technique exploits the RDBMS's query optimizer • Our technique significantly outperforms existing approaches Many opportunities for improvements and future work! Columbia University Computer Science Dept.

Approximate String Joins in a Database (Almost) for Free

Approximate String Joins in a Database (Almost) for Free

Presentation Transcript

Efficient Approximate Search on String Collections Part II

The STRING database

Faster Approximate String Matching over Compressed Text

Approximate String Matching using Compressed Suffix Arrays

XML data management and approximate string matching

Approximate String Matching

The Flamingo Software Package on Approximate String Queries

A String in the Labyrinth

JOINs

Joins in Hadoop

Efficient Approximate Search on String Collections Part I

Approximate Mean Value Analysis of a Database Grid Application

Two Different Approximate String Matching Problems and Their Algorithms

JOINS

String Similarity Measures and Joins with Synonyms

Fast Approximate Database Searching of Polypeptide Structures

Class string and String Stream Processing

Approximate Boyer-Moore String Matching

XML data management and approximate string matching

Approximate String Matching

Joins

Efficient Approximate Search on String Collections Part II