190 likes | 329 Views
VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams. VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern University) Xiaochun Yang (Northeastern University) Presented by Jae-won Lee. Introduction.
E N D
VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern University) Xiaochun Yang (Northeastern University) Presented by Jae-won Lee
Introduction • Many applications have an increasing need to support approximate string queries on data collections • Examples of approximate string queries • Data Cleaning – the same entity can be represented in slightly different forms • “PO BOX 23” and “P.O. Box 23” • Query Relaxation – errors in the query, inconsistencies in the data, limited knowledge about the data • “Steven Spielburg” and “Steve Spielberg” • Spellchecking – find potential candidates for a possibly mistyped word Center for E-Business Technology
ati ich ick ric sta sti stu tat tic tuc uck 4 id strings id strings id strings id strings at ch ck ic ri st ta ti tu uc 4 2 0 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 rich stick stich stuck static rich stick stich stuck static rich stick stich stuck static rich stick stich stuck static 2 0 1 1 3 0 0 1 2 4 4 0 2 1 4 2 3 1 3 4 4 1 2 4 2 1 4 3 3 3 3 Introduction • Dilemma of Choosing Gram Length • The gram length can greatly affect the performance of string matches • Increasing gram length • Causes the inverted list to be shorter • This may decrease the time to merge the inverted lists • Cases the lower threshold on the number of common grams • This causes a less selectiveness # of common grams >= 3 # of common grams >= 1 2-grams 3-grams Center for E-Business Technology
VGRAM : Main Idea • We analyze the frequencies of variable-length grams in the strings, and select a set of grams, called gram dictionary • For a string, we generate a set of grams of variable lengths using the gram dictionary • Challenges • How to generate variable-length grams ? • How to construct a high-quality gram dictionary ? • What is the relationship between string similarity and their gram-set similarity? • How to adopt VGRAM in existing algorithms ? Center for E-Business Technology
Challenge 1 : Generating Variable-Length Grams • Example • String s = universal • D = {ni, ivr, sal, uni, vers} • qmin = 2, qmax = 4 • By setting position p = 1, VG = {} • The longest substring starting at u that appears in D is uni (1, uni) • Move to the next character n, the longest substring is ni • However, this candidate (2, ni) is subsumed by the previous one, the algorithm does not insert it into VG • Move to the next character i, there is no substring starting at this character that matches a gram in D, so the algorithm produces (3, iv) of lengthqmin = 2 • Final set VG(s) = {(1, uni), (3, iv), (4, vers), (7, sal)} Center for E-Business Technology
Challenge 2:Constructing Gram Dictionary • Step 1 : Collecting gram frequencies with length in [qmin =2, qmax =4] st 0, 1, 3 sti 0, 1 stu3 stic 0, 1 stuc3 Leaf node Center for E-Business Technology
Challenge 2:Constructing Gram Dictionary • Step 2: Selecting High-Quality Grams • If a gram has a low frequency, we eliminate from the tree all the extended grams of g • If a gram is very frequent, keep some of its extended grams Center for E-Business Technology
Challenge 2:Constructing Gram Dictionary • Pruning tree using a frequency threshold T = 2 • Frequency of node (which has leaf node) ≤ T removed 8 Center for E-Business Technology
Challenge 2:Constructing Gram Dictionary • Pruning tree using a frequency threshold T = 2 • Frequency of node (which has leaf node) ≥ T • Pruning policies to be used to select a maximal subset of children to remove • SmallFirst : choose children with the smallest frequencies • LargeFirst : choose children with the largest frequencies • Random : Randomly choose children so that L.freq is not greater than T Center for E-Business Technology
Challenge 3:Similarity of Gram Sets • Analyzing the effect of an edit operation on the positional grams • These effects are stored NAG Vector (the vector of number of affected grams) • Category 1 : for positional gram (p, g) • p < i-qmax+1 or p+|g| -1 > i+qmax-1 • Category 2 : p ≤ i ≤ p+|g| -1 • Category 3 : positional gram (p, g) on the left of the i-th character • Category 4 : positional gram (p, g) on the right of the i-th character Category 2 Category 3 Category 4 Category 1 Category 1 String s i i-qmax+1 i+qmax- 1 Deletion Center for E-Business Technology
Challenge 3:Similarity of Gram Sets • Example • S = universal, D= {ni, ivr, sal, uni, vers}, qmin = 2, qmax = 4 • VG(s) = {(1, uni), (3, iv), (4,vers), (7,sal)} • A deletion on the 5-th character e in the string s • i-qmax +1 =2 , i+qmax -1 = 8 • Positional gram (1, uni) and (7, sal) is category 1 • Starting position is before 2 / ending position is after 8 • These gram are not affected by deletion operation • (4, vers) is category 2 • (3, iv) is category 3 • Since there is an extension of iv in D (ivr), (3, iv) could be affected by the deletion (potentially affected) Center for E-Business Technology
Challenge 3:Similarity of Gram Sets • # of grams affected by each operation • We want to transform string s to string s’ with 2 edit operations • At most 4 grams can be affected Deletion/substitution Insertion 1 1 1 1 2 1 1 1 1 1 0 0 1 1 2 1 1 2 1 _ u _ n _ i _v _ e _ r _s _ a _ l _ GAP ; insertion ? # of edit operation # of grams String S’ Center for E-Business Technology
… ck ic ich … tic tick … id strings … ck ic … ti … 1 3 1 3 0 1 2 3 4 rich stick stich stuck static 4 1 4 1 2 0 2 0 1 2 4 2 4 1 Challenge 4: Adopting VGRAM Technique • Example of Algorithm based on Inverted Lists • Query : Edit Distance (shtick , ?) ≤ 1 • VG(q) = { (1, sh), (2, ht), (3, tick) } ; whichare extracted using gram dictionary 2-4 grams 2 grams • # of common grams • = |VG(q)| - NAG(q, k) • = 3 – 2 = 1 • # of common grams • = (|s1|- q + 1) –k *q • = (6-2+1) – 1 * 2 = 3 Center for E-Business Technology
Experiments • Data Sets • Data set 1: Texas Real Estate Commission. • 151Kperson names, average length = 33. • Data set 2: English dictionary from the Aspell spellchecker for Cygwin. • 149,165 words, average length = 8. • Data set 3: DBLP Bibliography. • 277K titles, average length = 62. Center for E-Business Technology
VGRAM Overhead • Data set 3 Index Size Construction Time Center for E-Business Technology
Benefits of Using Variable-Length Grams • Data set 1 Construction Time/Size Query Time Center for E-Business Technology
Effect of qmax • Data Set 1 Construction Time / Query Time Query Performance Center for E-Business Technology
Effect of Frequency Threshold • Data Set 1 Index Size Query Time Construction Time Center for E-Business Technology
Conclusion • We developed VGRAM to improve performance of approximate string queries • Variable-length grams, High Quality grams • We gave a full specification of the technique • Index structure • How to generate grams for a string using index structure • Relationship btw the similarity of two strings and the similarity of their grams • We show how to adopt this technique in a variety of existing algorithms Center for E-Business Technology