730 likes | 840 Views
Searching and Integrating Information on the Web. Seminar 3: Data Cleansing Professor Chen Li UC Irvine. Paper readings. Efficient merge and purge: Hernandez and Stolfo, SIGMOD 1995 Approximate String Joins in a Database (Almost) for Free, Gravano et al, VLDB 2001,
E N D
Searching and Integrating Information on the Web Seminar 3: Data Cleansing Professor Chen Li UC Irvine
Paper readings • Efficient merge and purge: Hernandez and Stolfo, SIGMOD 1995 • Approximate String Joins in a Database (Almost) for Free, Gravano et al, VLDB 2001, • Efficient Record Linkage in Large Data Sets, Liang Jin, Chen Li, Sharad Mehrotra, DASFAA, 2003 • Sunita Sarawagi Anuradha Bhamidipaty, Interactive Deduplication Using Active Learning. Sarawagi and Bhamidipaty, KDD 2003 Seminar 3
Motivation • Correlate data from different data sources (e.g., data integration) • Data is often dirty • Needs to be cleansed before being used • Example: • A hospital needs to merge patient records from different data sources • They have different formats, typos, and abbreviations Seminar 3
Example Table R Table S • Find records from different datasets that could be the same entity Seminar 3
Another Example • P. Bernstein, D. Chiu: Using Semi-Joins to Solve Relational Queries. JACM 28(1): 25-40(1981) • Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981 Seminar 3
Record linkage Problem statement: “Given two relations, identify the potentially matched records • Efficiently and • Effectively” Seminar 3
Challenges • How to define good similarity functions? • Many functions proposed (edit distance, cosine similarity, …) • Domain knowledge is critical • Names: “Wall Street Journal” and “LA Times” • Address: “Main Street” versus “Main St” • How to do matching efficiently • Offline join version • Online (interactive) search • Nearest search • Range search Seminar 3
Outline • Supporting string-similarity joins using RDBMS • Using mapping techniques • Interactive deduplication Seminar 3
Edit Distance • A widely used metric to define string similarity • Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 • Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 Seminar 3
Approximate String Joins • We want to join tuples with “similar” string fields • Similarity measure: Edit Distance • Each Insertion, Deletion, Replacement increases distance by one Seminar 3
Focus: Approximate String Joins over Relational DBMSs • Join two tables on string attributes and keep all pairs of strings with Edit Distance ≤ K • Solve the problem in a database-friendly way (if possible with an existing "vanilla" RDBMS) Seminar 3
Current Approaches for Processing Approximate String Joins No native support for approximate joins in RDBMSs Two existing (straightforward) solutions: • Join data outside of DBMS • Join data via user-defined functions (UDFs) inside the DBMS Seminar 3
Approximate String Joins outside of a DBMS • Export data • Join outside of DBMS • Import the result Main advantage: We can exploit any state-of-the-art string-matching algorithm, without restrictions from DBMS functionality Disadvantages: • Substantial amounts of data to be exported/imported • Cannot be easily integrated with further processing steps in the DBMS Seminar 3
Approximate String Joins with UDFs • Write a UDF to check if two strings match within distance K • Write an SQL statement that applies the UDF to the string pairs SELECT R.stringAttr, S.stringAttr FROM R, S WHERE edit_distance(R.stringAttr, S.stringAttr, K) Main advantage: Ease of implementation Main disadvantage: UDF applied to entire cross-product of relations Seminar 3
Our Approach: Approximate String Joins over an Unmodified RDBMS • Preprocess data and generate auxiliary tables • Perform join exploiting standard RDBMS capabilities Advantages • No modification of underlying RDBMS needed. • Can leverage the RDBMS query optimizer. • Much more efficient than the approach based on naive UDFs Seminar 3
Intuition and Roadmap • Intuition: • Similar strings have many common substrings • Use exact joins to perform approximate joins (current DBMSs are good for exact joins) • A good candidate set can be verified for false positives [Ukkonen 1992, Sutinen and Tarhio 1996, Ullman 1977] • Roadmap: • Break strings into substrings of length q (q-grams) • Perform an exact join on the q-grams • Find candidate string pairs based on the results • Check only candidate pairs with a UDF to obtain final answer Seminar 3
What is a “Q-gram”? • Q-gram: A sequence of q characters of the original string Example for q=3 vacations {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns$, s$$} String with length L → L + q - 1 q-grams • Similar stringshave a many common q-grams Seminar 3
Q-grams and Edit Distance Operations • With no edits: L + q - 1 common q-grams • Replacement: (L + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacalions: {##v, #va, vac, aca, cal, ali, lio, ion, ons, ns#, s$$} • Insertion: (Lmax + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacatlions: {##v, #va, vac, aca, cat, atl, tli, lio, ion, ons, ns#, s$$} • Deletion: (Lmax + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacaions: {##v, #va, vac, aca, cai, aio, ion, ons, ns#, s$$} Seminar 3
Number of Common Q-grams and Edit Distance • For Edit Distance = K, there could be at most K replacements, insertions, deletions • Two strings S1 and S2 with Edit Distance ≤ K have at least [max(S1.len, S2.len) + q - 1] – Kq q-grams in common • Useful filter: eliminate all string pairs without "enough" common q-grams (no false dismissals) Seminar 3
Using a DBMS for Q-gram Joins • If we have the q-grams in the DBMS, we can perform this counting efficiently. • Create auxiliary tables with tuples of the form: <sid, strlen, qgram> and join these tables • A GROUP BY – HAVING COUNT clause can perform the counting / filtering Seminar 3
Eliminating Candidate Pairs: COUNT FILTERING SQL for this filter: (parts omitted for clarity) SELECT R.sid, S.sid FROM R, S WHERE R.qgram=S.qgram GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q - 1) – K*q The result is the pair of strings with sufficiently enough common q-grams to ensure that we will not have false negatives. Seminar 3
Eliminating Candidate Pairs Further: LENGTH FILTERING Strings with length difference larger than K cannot be within Edit Distance K SELECT R.sid, S.sid FROM R, S WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q We refer to this filter as LENGTH FILTERING Seminar 3
Exploiting Q-gram Positions for Filtering • Consider strings aabbzzaacczz and aacczzaabbzz • Strings are at edit distance 4 • Strings have identical q-grams for q=3 Problem: Matching q-grams that are at different positions in both strings • Either q-grams do not "originate" from same q-gram, or • Too many edit operations "caused" spurious q-grams at various parts of strings to match Seminar 3
POSITION FILTERING - Filtering using positions • Keep the position of the q-grams <sid, strlen, pos, qgram> • Do not match q-grams that are more than K positions away SELECT R.sid, S.sid FROM R, S WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K AND abs(R.pos - S.pos)<=K GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q We refer to this filter as POSITION FILTERING Seminar 3
The Actual, Complete SQL Statement SELECT R1.string, S1.string, R1.sid, S1.sid FROM R1, S1, R, S, WHERE R1.sid=R.sid AND S1.sid=S.sid AND R.qgram=S.qgram AND abs(strlen(R1.string)–strlen(S1.string))<=K AND abs(R.pos - S.pos)<=K GROUP BY R1.sid, S1.sid, R1.string, S1.string HAVING COUNT(*) >= (max(strlen(R1.string),strlen(S1.string))+ q-1)–K*q Seminar 3
Summary of 1st paper • Introduced a technique for mapping approximate string joins into a “vanilla” SQL expression • Our technique does not require modifying the underlying RDBMS Seminar 3
Outline • Supporting string-similarity joins using RDBMS • Using mapping techniques • Interactive deduplication Seminar 3
Single-attribute Case • Given • two sets of strings, R and S • a similarity function f between strings (metric space) • Reflexive: f(s1,s2) = 0 iff s1=s2 • Symmetric: f(s1,s2) = d(s2, s1) • Triangle inequality: f(s1,s2)+f(s2,s3) >= f(s1,s3) • a threshold k • Find: all pairs of strings (r, s) from R and S, such that f(r,s) <= k. R S Seminar 3
Nested-loop? • Not desirable for large data sets • 5 hours for 30K strings! Seminar 3
Our 2-step approach • Step 1: map strings (in a metric space) to objects in a Euclidean space • Step 2: do a similarity join in the Euclidean space Seminar 3
Advantages • Applicable to many metric similarity functions • Use edit distance as an example • Other similarity functions also tried, e.g., q-gram-based similarity • Open to existing algorithms • Mapping techniques • Join techniques Seminar 3
Step 1 Map strings into a high-dimensional Euclidean space Metric Space Euclidean Space Seminar 3
Mapping: StringMap • Input: A list of strings • Output: Points in a high-dimensional Euclidean space that preserve the original distances well • A variation of FastMap • Each step greedily picks two strings (pivots) to form an axis • All axes are orthogonal Seminar 3
Can it preserve distances? • Data Sources: • IMDB star names: 54,000 • German names: 132,000 • Distribution of string lengths: Seminar 3
Can it preserve distances? • Use data set 1 (54K names) as an example • k=2, d=20 • Use k’=5.2 to differentiate similar and dissimilar pairs. Seminar 3
Choose Dimensionality d Increase d? • Good : • better to differentiate similar pairs from dissimilar ones. • Bad : • Step 1: Efficiency ↓ • Step 2: “curse of dimensionality” Seminar 3
# of pairs within distance w Cost= # of similar pairs Choose dimensionality d using sampling • Sample 1Kx1K strings, find their similar pairs (within distance k) • Calculate maximum of their new distances w • Define “Cost” of finding a similar pair: Seminar 3
Choose Dimensionality d d=15 ~ 25 Seminar 3
Choose new threshold k’ • Closely related to the mapping property • Ideally, if ed(r,s) <= k, the Euclidean distance between two corresponding points <= k’. • Choose k’ using sampling • Sample 1Kx1K strings, find similar pairs • Calculate their maximum new distance as k’ • repeat multiple times, choose their maximum Seminar 3
New threshold k’ in step 2 d=20 Seminar 3
Step 2: Similarity Join • Input: Two sets of points in Euclidean space. • Output: Pairs of two points whose distance is less than new threshold k’. • Many join algorithms can be used Seminar 3
Example • Adopted an algorithm by Hjaltason and Samet. • Building two R-Trees. • Traverse two trees, find points whose distance is within k’. • Pruning during traversal (e.g., using MinDist). Seminar 3
Final processing • Among the pairs produced from the similarity-join step, check their edit distance. • Return those pairs satisfying the threshold k Seminar 3
Running time Seminar 3
Recall • Recall: (#of found similar pairs)/(#of all similar pairs) Seminar 3
Multi-attribute linkage • Example: title + name + year • Different attributes have different similarity functions and thresholds • Consider merge rules in disjunctive format: Seminar 3
Evaluation strategies • Many ways to evaluate rules • Finding an optimal one: NP-hard • Heuristics: • Treat different conjuncts independently. Pick the “most efficient” attribute in each conjunct. • Choose the largest threshold for each attribute. Then choose the “most efficient” attribute among these thresholds. Seminar 3
Summary of 2nd paper • A novel two-step approach to record linkage. • Many existing mapping and join algorithms can be adopted • Applicable to many distance metrics. • Time and space efficient. • Multi-attribute case studied Seminar 3
Outline • Supporting string-similarity joins using RDBMS • Using mapping techniques • Interactive deduplication Seminar 3
Matching Functions Calculate similarity scores, thresholds Tedious coding Learning-based Methods Require large-sized training set for accuracy (static training set) Difficult to provide a covering and challenging training set that will bring out the subtlety of deduplication function Problems with Existing Deduplication Methods Seminar 3