1 / 40

Efficient Merging and Filtering Algorithms for Approximate String Searches

Efficient Merging and Filtering Algorithms for Approximate String Searches. Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming Lu. Example: a movie database. Find movies starred Schwarrzenger . 2. Data may not clean. Data integration and cleaning:. Relation R.

sandro
Download Presentation

Efficient Merging and Filtering Algorithms for Approximate String Searches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming Lu

  2. Example: a movie database Find movies starred Schwarrzenger. 2

  3. Data may not clean • Data integration and cleaning: Relation R Relation S 3

  4. Problem definition: approximate string searches Collection of strings s Star Search Keanu Reeves Samuel Jackson Query q: Schwarzenger … Schwarrzenger Output: strings s that satisfy Sim(q,s)≤δ Sim functions: edit distance, Jaccard Coefficient and Cosine similarity 4

  5. Outline • Problem motivation • Preliminaries • Grams • Inverted lists • Merge algorithms • Filtering techniques • Conclusion 5

  6. String  Grams q-grams For example: 2-gram (un),(ni),(iv),(ve),(er),(rs),(sa),(al) 6

  7. id strings at ch ck ic ri st ta ti tu uc 0 1 2 3 4 rich stick stich stuck static 2-grams 0 4 2 3 0 1 4 3 2 3 3 0 1 2 4 4 1 2 4 1 Inverted lists • Convert strings to gram inverted lists 7

  8. Main Example Query ed(s,q)≤1 (st,ti,ic,ck) stick Candidates Data Grams ck ic st ta ti … 1,3 1,2,4 0, Merge 1,2,3,4 count >=2 4 1,2,4 8

  9. Problem definition: Merge Ascending order Find elements whose occurrences ≥ T 9

  10. Example • T = 4 1 3 5 10 13 10 13 15 5 7 13 13 15 Result: 13 10

  11. Contributions Three newmergealgorithms New finding: wisely using filters

  12. Outline • Problem motivation • Preliminaries • Merge algorithms • Two previous algorithms • Our proposed three algorithms • Filtering techniques • Conclusion 12

  13. Five Merge Algorithms HeapMerger [Sarawagi,SIGMOD 2004] MergeOpt [Sarawagi,SIGMOD 2004] Previous New ScanCount MergeSkip DivideSkip 13

  14. Heap-based Algorithm Push to heap …… Min-heap Count # of the occurrences of each element by a heap 14

  15. MergeOpt Algorithm Binary search Long Lists: T-1 Short Lists 15

  16. Example of MergeOpt [Sarawagi et al 2004] 1 3 5 10 13 10 13 15 5 7 13 13 15 Long Lists: 3 Short Lists: 2 Count threshold T≥ 4 16

  17. Can we run faster? 17

  18. Five Merge Algorithms HeapMerger MergeOpt Previous New ScanCount MergeSkip DivideSkip 18

  19. ScanCount Example String ids # of occurrences Increment by 1 1 2 3 … 1 0 1 3 5 10 13 10 13 15 5 7 13 13 15 0 1 0 13 4 0 Result! 14 0 15 2 0 Count threshold T≥ 4 19

  20. Five Merge Algorithms HeapMerger MergeOpt Previous New ScanCount MergeSkip DivideSkip 20

  21. MergeSkip algorithm Pop T-1 …… Min-heap Jump Greater or equals T-1 21

  22. Example of MergeSkip 1 minHeap 5 10 13 15 1 3 5 10 10 15 5 7 13 15 13 13 Jump 17 17 15 15 Count threshold T≥ 4 22

  23. Skip is safe Min-heap …… Skip # of occurrences of skipped elements ≤T-1 23

  24. Five Merge Algorithms HeapMerger MergeOpt Previous New ScanCount MergeSkip DivideSkip 24

  25. DivideSkip Algorithm Binary search MergeSkip Long Lists Short Lists

  26. How many lists are treated as long lists? Long Lists Short Lists Lookup Merge ? 26

  27. Decide L value A good balance in the tradeoff: # of long lists = T / ( μ logM +1) 27

  28. Experimental data sets DBLP data IMDB data Google Web corpus 28

  29. Performance (DBLP) DivideSkip is the best one 29

  30. # of access elements (DBLP) DivideSkip is the best one 30

  31. Outline • Problem motivation • Preliminaries • Merge algorithms • Filtering techniques • Length, positional filters • Filter tree • Conclusion and future work 31

  32. Length Filtering Length: 10 s: By length only! Ed(s,t) ≤ 2 t: Length: 19 32

  33. Positional Filtering Ed(s,t) ≤ 2 s (ab,1) t (ab,12) 33

  34. root … 1 2 3 n … aa ab zy zz 1 2 m 5 12 17 28 44 Filter tree Length level Gram level … Position level Inverted list 34

  35. Surprising experimental results (DBLP) Why adding position filter increases the running time?

  36. Filters fragment inverts lists Merge Merge Merge Merge Applying filters • Cost: • Tree traversal • More merging Saving: reduce total lists size

  37. Conclusion • Three newmergealgorithms • We run faster • Interesting finding: Do not abuse filters!

  38. Related work Approximate string matching [Navarro 2001] Fuzzy lookup in Varied length Grams [Li et al 2007] 38

  39. References • [Arasu 2006] A. Arasu and V. Ganti and R. Kaushik “Efficient Exact Set-similarity Joins” in VLDB 2006 • [Chaudhuri 2003] S. Chaudhuri ,K Ganjam, V. Ganti and R. Motwani “Robust and Efficient Fuzzy Match for online Data Cleaning” in SIGMOD 2003 • [Gravano 2001] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan and D. Srivastava “Approximate string joins in a database almost for free” in VLDB 2001 39

  40. References 4. [Li 2007] C. Li, B Wang and X. Yang “VGRAM:Improving performance of approximate queries on string collections using variable-length grams ” in VLDB 2007 5. [Navarro 2001] G. Navarro, “A guided tour to approximate string matching” in Computing survey 2001 6. [Sarawagi 2004] S. Sarawagi and A. Kirpal, “Efficient set joins on similarity predicates” in ACM SIGMOD 2004 40

More Related