1 / 63

Efficient Approximate Search on String Collections

Efficient Approximate Search on String Collections. Marios Hadjieleftheriou. Chen Li. Outline. Part 1: Motivation and preliminaries Inverted list based algorithms Part 2: Gram signature algorithms Length normalized algorithms Selectivity estimation Conclusion and future directions.

yule
Download Presentation

Efficient Approximate Search on String Collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Approximate Search on String Collections Marios Hadjieleftheriou Chen Li

  2. Outline Part 1: • Motivation and preliminaries • Inverted list based algorithms Part 2: • Gram signature algorithms • Length normalized algorithms • Selectivity estimation • Conclusion and future directions

  3. Web Search • Errors in queries • Errors in data • Bring query and meaningful results closer together Actual queries gathered by Google http://www.google.com/jobs/britney.html

  4. Record Linkage R S • Edit distance • Jaccard • Cosine • … Record linkage

  5. Document Cleaning Should be “Niels Bohr” Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope

  6. Demos • http://directory.uci.edu/ • http://psearch.ics.uci.edu/advanced/ • http://psearch.ics.uci.edu/

  7. State-of-the-art: Oracle 10g and older • Supported by Oracle Text • CREATE TABLE engdict(word VARCHAR(20), len INT); • Create preferences for text indexing: begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; / • CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF'); • Usage: SELECT * FROM engdict WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0; • Limitation: cannot handle errors in the first letters: Katherine versus Catherine

  8. Microsoft SQL Server [CGG+05] • Data cleaning tools available in SQL Server 2005 • Part of Integration Services • Supports fuzzy lookups • Uses data flow pipeline of transformations • Similarity function: tokens with TF/IDF scores 8

  9. Lucene • Using Levenshtein Distance (Edit Distance). • Example: roam~0.8 • Prefix filtering followed by a scan (Efficiency?)

  10. Problem Formulation Find strings similar to a given string • Performance is important! • 10 ms: 100 queries per second (QPS) • 5 ms: 200 QPS

  11. Similarity Functions • Similar to: • a domain-specific function • returns a similarity value between two strings • Examples: • Edit distance • Hamming distance • Jaccard similarity • Soundex • TF/IDF, BM25, DICE • See [KSS06] for an excellent survey

  12. A widely used metric to define string similarity Ed(s1,s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 Edit Distance 12

  13. Outline • Motivation and preliminaries • Inverted list based algorithms • List-merging algorithms • VGRAM • List-compression techniques • Gram signature algorithms • Length normalized algorithms • Selectivity estimation • Conclusion and future directions

  14. “q-grams” of strings u n i v e r s a l 2-grams

  15. Edit operation’s effect on grams k operations could affect k * q grams Fixed length: q u n i v e r s a l

  16. id strings at ch ck ic ri st ta ti tu uc 0 1 2 3 4 rich stick stich stuck static 2-grams 4 2 3 0 1 4 3 0 3 1 0 1 2 4 4 1 2 4 2 3 q-gram inverted lists

  17. # of common grams >= 3 id strings at ch ck ic ri st ta ti tu uc 4 0 1 2 3 4 rich stick stich stuck static 2 0 1 3 0 1 2 4 0 4 2 3 1 4 1 2 4 3 3 Searching using inverted lists • Query: “shtick”, ED(shtick, ?)≤1 sh ht ti ic ck ic ck ti 2-grams

  18. T-occurrence Problem Merge Ascending order Find elements whose occurrences ≥ T

  19. Example T = 4 1 3 5 10 13 10 13 15 5 7 13 13 15 Result: 13

  20. List-Merging Algorithms HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip

  21. Heap-based Algorithm Push to heap …… Min-heap Count # of occurrences of each element using a heap

  22. MergeOpt Algorithm [SK04] Binary search Long Lists: T-1 Short Lists

  23. Example of MergeOpt 1 3 5 10 13 10 13 15 5 7 13 13 15 Long Lists: 3 Short Lists: 2 Count threshold T≥ 4

  24. ScanCount String ids # of occurrences Increment by 1 1 2 3 … 1 0 1 3 5 10 13 10 13 15 5 7 13 13 15 0 1 0 13 4 0 Result! 14 0 15 2 0 Count threshold T≥ 4 24

  25. List-Merging Algorithms HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip

  26. MergeSkip algorithm [BK02, LLL08] Pop T-1 …… Min-heap Jump Greater or equals T-1

  27. Example of MergeSkip 1 minHeap 5 10 13 15 1 3 5 10 10 15 5 7 13 15 13 13 Jump 17 17 15 15 Count threshold T≥ 4

  28. DivideSkip Algorithm [LLL08] Binary search MergeSkip Long Lists Short Lists

  29. How many lists are treated as long lists? Long Lists Short Lists Lookup Merge ? A good balance in the tradeoff: # of long lists = T / ( μ logM +1)

  30. Length Filtering Length: 10 s: By length only! Ed(s,t) ≤ 2 t: Length: 19

  31. Positional Filtering Ed(s,t) ≤ 2 s (ab,1) t (ab,12)

  32. root … 1 2 3 n … aa ab zy zz 1 2 m 5 12 17 28 44 Filter tree [LLL08] Length level Gram level … Position level Inverted list

  33. Surprising experimental results (DBLP) Adding position filter could increase running time

  34. Filters fragment inverts lists Merge Merge Merge Merge Applying filters Saving: reduce list size. Cost: - Tree traversal, - More merging

  35. Outline • Motivation and preliminaries • Inverted list based algorithms • List-merging algorithms • VGRAM [LWY07,YWL08] • List-compression techniques • Gram signature algorithms • Length normalized algorithms • Selectivity estimation • Conclusion and future directions

  36. ati ich ick ric sta sti stu tat tic tuc uck 4 id strings id strings id strings 2 0 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 rich stick stich stuck static rich stick stich stuck static rich stick stich stuck static 1 0 4 2 1 3 4 2 1 4 3 3 2-grams -> 3-grams? • Query: “shtick”, ED(shtick, ?)≤1 sht hti tic ick ick tic # of common grams >= 1 3-grams

  37. id strings at ch ck ic ri st ta ti tu uc 0 1 2 3 4 rich stick stich stuck static 2-grams 1 4 2 3 0 1 4 3 0 3 0 1 2 4 4 1 2 4 2 3 Observation 1: dilemma of choosing “q” • Increasing “q” causing: • Longer grams  Shorter lists • Smaller # of common grams of similar strings

  38. Observation 2: skew distributions of gram frequencies • DBLP: 276,699 article titles • Popular 5-grams: ation (>114K times), tions, ystem, catio

  39. VGRAM: Main idea • Grams with variable lengths (between qmin and qmax) • zebra • ze(123) • corrasion • co(5213), cor(859), corr(171) • Advantages • Reduce index size  • Reducing running time  • Adoptable by many algorithms 

  40. Challenges • Generatingvariable-length grams? • Constructing a high-quality gram dictionary? • Relationship between string similarity and their gram-set similarity? • Adopting VGRAM in existing algorithms?

  41. [2,4]-gram dictionary ni ivr sal uni vers Challenge 1: String  Variable-length grams? • Fixed-length 2-grams u n i v e r s a l • Variable-length grams u n i v e r s a l

  42. Representing gram dictionary as a trie ni ivr sal uni vers

  43. Challenge 2: Constructing gram dictionary Step 1: Collecting frequencies of grams with length in [qmin, qmax] st  0, 1, 3 sti 0, 1 stu3 stic 0, 1 stuc3 Gram trie with frequencies

  44. Step 2: selecting grams • Pruning trie using a frequency threshold F (e.g., 2)

  45. Step 2: selecting grams (cont) Threshold T = 2

  46. Final gram dictionary A cost-based approach to choosing a gram dictionary [YWL08]

  47. Challenge 3: Edit operation’s effect on grams k operations could affect k * q grams Fixed length: q u n i v e r s a l

  48. Deletion affects variable-length grams Not affected Not affected Affected i i-qmax+1 i+qmax- 1 Deletion

  49. [2,4]-grams ni ivr sal uni vers Grams affected by a deletion Affected? i i-qmax+1 i+qmax- 1 Deletion Deletion u n i v e r s a l Affected?

  50. Grams affected by a deletion (cont) Affected? i i-qmax+1 i+qmax- 1 Deletion Trie of grams Trie of reversed grams

More Related