340 likes | 352 Views
Using a cost-based approach, this study explores the selection of variable-length grams to support efficient approximate queries in string collections. The study analyzes the effects of adding grams on index and query performance, and proposes a quantitative approach for constructing high-quality gram dictionaries.
E N D
Northeastern University, China Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li
Approximate selection queries Schwarrzenger Query errors: • Limited knowledge about data • Typos • Limited input device (cell phone) input Data errors • Typos • Web data • OCR Similarity functions: • Edit distance • Jaccard • Cosine • … Applications • Spellchecking • Query relaxation • …
Performance is a big issue • Answer queries interactively • Many queries on a server
Outline • Motivation • Tightening lower bound of common strings • Effects of adding a gram on index and queries • Cost-based construction of gram dictionary • Experiments
q-grams b i n go n 2-grams
id strings 1 2 3 4 5 6 bingo bioinng bitingin biting boing going q-gram inverted lists 2-grams
id strings 1 2 3 4 5 6 bingo bioinng bitingin biting boing going Query processing • ED(bingon, ?)≤1 # of common grams >= 3 2-grams
n 1 b t n g o i n n n n n n 4 6 2 7 5 3 o n t i n o g i o i i n n n n n n n n n n n 10 8 9 11 12 13 14 17 15 16 18 g n # # # # # # # # # # # n n n n n n n n n n n n n 19 20 24 21 22 25 23 26 27 31 29 30 28 # # n n 32 33 VGRAM: variable-length grams[VLDB07] [2,3]-gram dictionary b i n go n
n 1 b t n g o i n n n n n n 4 6 2 b i n g o n 7 5 3 b i n g o n o i n t n o g i o i i n n n n n n n n n n n 10 8 9 11 12 13 14 17 15 16 18 g n # # # # # # # # # # # # of common grams >= 3 n n n n n n n n n n n n n 19 20 21 22 24 25 23 26 27 29 30 31 28 # # n n 32 33 Adopting VGRAM in algorithms grams string VGRAM lower bound gram dictionary
Contributions of this study • Tightening lower bounds using dynamic programming • Cost-based quantitative approach • Analyze and estimate query performance when adding each gram • Automatically find high-quality grams High quality gram Gram dictionary String collection
Outline • Motivation • Tightening lower bound of common strings • Effects of adding a gram on index and queries • Cost-based construction of gram dictionary • Experiments
Calculating lower bound Fixed length (q) b i i n d i n g ed(s1,s2) <= k, then # of common grams >= # of s1 grams –k *q
Calculating lower bound Variable lengths 2 2 2 1 1 3 3 1 b i i n d i n g lower bound =# of grams of s1 – NAG(s1,k)
Too pessimistic? • k-Max: Summation of k largest values NAG(s,2)=3+3=6 2 2 2 1 1 3 3 1 b i i n d i n g
Tightening lower bound • Dynamic programming: tightening NAG(s,k) • Subproblems: NAG(s[1,j], i) opi String s j 1
opi opi-1 Dynamic programming • Recurrence function B[ j ] opi String s j 1
Dynamic programming 2 2 2 1 1 3 3 1 b i i n d i n g k=0 NAG vector k=1 k=2
Outline • Motivation • Tightening lower bound of common strings • Effects of adding a gram on index and queries • Cost-based construction of gram dictionary • Experiments
Effects on inverted lists Gram dictionary Gram dictionary ab ab add gram abc bc bc abc string --abc-- --ab-- --bc--
Effects on query performance • Decrease query’s inverted list • Change lower bound • Change # of candidates
Effects on query’s inverted lists Gram dictionary Gram dictionary ab ab add gram abc bc bc abc Query Q • Adding a new gram abc will not change or decrease the query’s inverted lists
Effects on lower bound • Query: Q, ED(Q, ?)≤1 Query Q Query Q
Effects on # of candidates • Change lower bound change # of candidates Gram dictionary Gram dictionary ab ab add gram abc bc bc abc Query Q
Outline • Motivation • Tightening lower bound of common strings • Effects of adding a gram on index and queries • Cost-based construction of gram dictionary • Experiments
Construct a gram dictionary[VLDB07] qmin=2 qmax=4
Cost-base construction qmin=2
Outline • Motivation • Tightening lower bound of common strings • Effects of adding a gram on index and queries • Cost-based construction of gram dictionary • Experiments
Data sets Environment: GNU C++, Dell GX620 PC with an Intel Pentium 2.40Hz Dual Core CPU, 2GB memory, 250GB disk, Ubuntu (Linux) O.S. Index structure were assumed to be in memory
Effect of Tightening Lower Bound 1M Actor names, Construct gram dictionary: 100,000 sample strings, 5000 queries, qmin = 4
Comparison with algorithm Prune [VLDB07] Dataset: 1M article titles Prune: qmin=5, qmax=7, T=2000, LargeFirst policy GramGen: 1% sampling ratio, 2000 queries, (qmin=5 automatically determined)
Choosing qmin Construct gram dictionary: (a) 3,000 queries, (b) sample ratio=2%
Conclusions • Tightening lower bound • Dynamic programming • Analysis of adding a gram affects • Index structure • Performance of queries • Efficient algorithm • Automatically generating a high-quality gram dictionary
Thank you Questions or Comments?
Related work • Approximate String Matching • q-Grams, q-Samples • Inside DBMS • Substring matching • Set similarity join • Estimation • Selectivity of SQL LIKE substring queries • Approximate string answers