1 / 34

Cost-Based Variable-Length-Gram Selection for String Collections

Using a cost-based approach, this study explores the selection of variable-length grams to support efficient approximate queries in string collections. The study analyzes the effects of adding grams on index and query performance, and proposes a quantitative approach for constructing high-quality gram dictionaries.

millieh
Download Presentation

Cost-Based Variable-Length-Gram Selection for String Collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Northeastern University, China Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li

  2. Approximate selection queries Schwarrzenger Query errors: • Limited knowledge about data • Typos • Limited input device (cell phone) input Data errors • Typos • Web data • OCR Similarity functions: • Edit distance • Jaccard • Cosine • … Applications • Spellchecking • Query relaxation • …

  3. Performance is a big issue • Answer queries interactively • Many queries on a server

  4. Outline • Motivation • Tightening lower bound of common strings • Effects of adding a gram on index and queries • Cost-based construction of gram dictionary • Experiments

  5. q-grams b i n go n 2-grams

  6. id strings 1 2 3 4 5 6 bingo bioinng bitingin biting boing going q-gram inverted lists 2-grams

  7. id strings 1 2 3 4 5 6 bingo bioinng bitingin biting boing going Query processing • ED(bingon, ?)≤1 # of common grams >= 3 2-grams

  8. n 1 b t n g o i n n n n n n 4 6 2 7 5 3 o n t i n o g i o i i n n n n n n n n n n n 10 8 9 11 12 13 14 17 15 16 18 g n # # # # # # # # # # # n n n n n n n n n n n n n 19 20 24 21 22 25 23 26 27 31 29 30 28 # # n n 32 33 VGRAM: variable-length grams[VLDB07] [2,3]-gram dictionary b i n go n

  9. n 1 b t n g o i n n n n n n 4 6 2 b i n g o n 7 5 3 b i n g o n o i n t n o g i o i i n n n n n n n n n n n 10 8 9 11 12 13 14 17 15 16 18 g n # # # # # # # # # # # # of common grams >= 3 n n n n n n n n n n n n n 19 20 21 22 24 25 23 26 27 29 30 31 28 # # n n 32 33 Adopting VGRAM in algorithms grams string VGRAM lower bound gram dictionary

  10. Contributions of this study • Tightening lower bounds using dynamic programming • Cost-based quantitative approach • Analyze and estimate query performance when adding each gram • Automatically find high-quality grams High quality gram Gram dictionary String collection

  11. Outline • Motivation • Tightening lower bound of common strings • Effects of adding a gram on index and queries • Cost-based construction of gram dictionary • Experiments

  12. Calculating lower bound Fixed length (q) b i i n d i n g ed(s1,s2) <= k, then # of common grams >= # of s1 grams –k *q

  13. Calculating lower bound Variable lengths 2 2 2 1 1 3 3 1 b i i n d i n g lower bound =# of grams of s1 – NAG(s1,k)

  14. Too pessimistic? • k-Max: Summation of k largest values NAG(s,2)=3+3=6 2 2 2 1 1 3 3 1 b i i n d i n g

  15. Tightening lower bound • Dynamic programming: tightening NAG(s,k) • Subproblems: NAG(s[1,j], i) opi String s j 1

  16. opi opi-1 Dynamic programming • Recurrence function B[ j ] opi String s j 1

  17. Dynamic programming 2 2 2 1 1 3 3 1 b i i n d i n g k=0 NAG vector k=1 k=2

  18. Outline • Motivation • Tightening lower bound of common strings • Effects of adding a gram on index and queries • Cost-based construction of gram dictionary • Experiments

  19. Effects on inverted lists Gram dictionary Gram dictionary ab ab add gram abc bc bc abc string --abc-- --ab-- --bc--

  20. Effects on query performance • Decrease query’s inverted list • Change lower bound • Change # of candidates

  21. Effects on query’s inverted lists Gram dictionary Gram dictionary ab ab add gram abc bc bc abc Query Q • Adding a new gram abc will not change or decrease the query’s inverted lists

  22. Effects on lower bound • Query: Q, ED(Q, ?)≤1 Query Q Query Q

  23. Effects on # of candidates • Change lower bound  change # of candidates Gram dictionary Gram dictionary ab ab add gram abc bc bc abc Query Q

  24. Outline • Motivation • Tightening lower bound of common strings • Effects of adding a gram on index and queries • Cost-based construction of gram dictionary • Experiments

  25. Construct a gram dictionary[VLDB07] qmin=2 qmax=4

  26. Cost-base construction qmin=2

  27. Outline • Motivation • Tightening lower bound of common strings • Effects of adding a gram on index and queries • Cost-based construction of gram dictionary • Experiments

  28. Data sets Environment: GNU C++, Dell GX620 PC with an Intel Pentium 2.40Hz Dual Core CPU, 2GB memory, 250GB disk, Ubuntu (Linux) O.S. Index structure were assumed to be in memory

  29. Effect of Tightening Lower Bound 1M Actor names, Construct gram dictionary: 100,000 sample strings, 5000 queries, qmin = 4

  30. Comparison with algorithm Prune [VLDB07] Dataset: 1M article titles Prune: qmin=5, qmax=7, T=2000, LargeFirst policy GramGen: 1% sampling ratio, 2000 queries, (qmin=5 automatically determined)

  31. Choosing qmin Construct gram dictionary: (a) 3,000 queries, (b) sample ratio=2%

  32. Conclusions • Tightening lower bound • Dynamic programming • Analysis of adding a gram affects • Index structure • Performance of queries • Efficient algorithm • Automatically generating a high-quality gram dictionary

  33. Thank you Questions or Comments?

  34. Related work • Approximate String Matching • q-Grams, q-Samples • Inside DBMS • Substring matching • Set similarity join • Estimation • Selectivity of SQL LIKE substring queries • Approximate string answers

More Related