1 / 31

Efficient Algorithms for Approximate Member Extraction Using Signature-based Inverted Lists

Efficient Algorithms for Approximate Member Extraction Using Signature-based Inverted Lists. Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin University of China. Introduction: An Example. A dictionary of strings we are interested in E.g. product names, postal addresses…

ranae
Download Presentation

Efficient Algorithms for Approximate Member Extraction Using Signature-based Inverted Lists

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Algorithms for Approximate Member Extraction Using Signature-based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin University of China

  2. Introduction: An Example • A dictionary of strings we are interested in • E.g. product names, postal addresses… • We are going to locate their “approximate apparences” in a series of documents. • See the meaning of “approximate apparence” in the following example:

  3. Problem Definition • Given a dictionary R and a threshold δ, extract all proper substrings m from input documents S such that there exists r ∈R, and Similarity (r, m) ≥δ(or Distance(r, m) ≤k). • Here we call r a piece of evidence for m. • Similarity() is a function measuring the similarity of two strings • Strings are viewed as sets of tokens (words) • An example for Sim(): Jaccard similarity:

  4. Outline • Introduction • State-of-the-art techniques • The filtration-verification framework • K-signature scheme • Inverted Signature-based Hashtable • Our algorithms and evaluations • Conclusion

  5. Why pre-pruning is needed • We need spot evidence to decide whether a substring m should be extracted • Simple verification on all dictionary strings may be inefficient • Pre-pruning and post-verifying is beneficial • But should it be running-speed-oriented or filtering-power-oriented? • Less time or less survivors?

  6. More(less) filtration time Strong(weak) filtration power Overall performance =Tf+Tv????? Less(more) verification time Fewer(more) candidates The issue of compromise comes again • Balance between the two stages should be reached:

  7. Outline • Introduction • State-of-the-art techniques • The filtration-verification framework • K-signature scheme • Inverted Signature-based Hashtable • Our algorithms and evaluations • Conclusion

  8. K-signature scheme • K-signature scheme • Proposed by Chakrabarti et al. (SIGMOD 2008) • Choose several top-weighted tokens in a string as signatures to represent it: s => Sig(s) • Observation: if r cannot match m, r is likely to have insufficient signature overlapping with m • K is a parameter for filtration power tuning • Potential evidence loss • A counter-example found when k=3 • We tried and only proved that it works for k=1 and k=∞

  9. Outline • Introduction • State-of-the-art techniques • The filtration-verification framework • K-signature scheme • Inverted Signature-based Hashtable • Our algorithms and evaluations • Conclusion

  10. Inverted Signature-based Hashtable • Proposed by Chakrabarti et al. (SIGMOD 2008) • Each dictionary string encoded into a solid 0-1 matrix • An ‘1’ for each occurrence of a <token,sig-token> tuple (‘1’- rectangle) • Bitwise-or all solid matrices to get the matrix of R • Observation: if m is an approximate member of R, the matrix of m must have enough intersections with that of R. • Formalized into an NPC problem • Solution causes too weak filtering power

  11. Outline • Introduction • State-of-the-art techniques • Our algorithms and evaluations • Corrected filtering conditions • EvSCAN: Filtration by SIL • EvITER: Incremental optimization on EvSCAN • Supporting Dynamic Thresholds • Conclusion

  12. Too strict ! Proved by us Our proposed theorem • If Sim(m,r) ≥δ, what do we have ? wt(Sig(m)∩Sig(r)) ≥ τ(m) wt(Sig(m)∩Sig(r)) ≥ min{τ(m),τ(r) } • So the threshold does not remain constant • involves unknown evidence • Our solution: Use inverted lists to count sig-token overlappings. • Note that sig-tokens usually have low document frequency (e.g. IDF as weights)

  13. Outline • Introduction • State-of-the-art techniques • Our algorithms and evaluations • Corrected filtering conditions • EvSCAN: Filtration by SIL • EvITER: Incremental optimization on EvSCAN • Supporting Dynamic Thresholds • Conclusion

  14. 5d, 9.0 canon, 2.0 camera, 1.0 eos, 7.0 nikon, 2.0 slr, 2.0 1 1 2 1 2 2 3 3 Signature-based Inverted Lists • Lists indexed by sig-tokens • Each sig-token of a string creates a node (containing the string’s id) in the corresponding list. • E.g. R = { r1 = “canon eos 5d digital camera", r2 =“nikon digital slrcamera”, r3=“canon slr camera”}. • wt(digital, camera, canon, nikon, slr, eos, 5d) = (1, 1, 2, 2, 2, 7 ,9).

  15. 2 3 2 1 2 Qualified! Filtration by SIL • Using an array called “accumulator” to compute the overlapped sig weight wt(Sig(m)∩Sig(r)) • E.g. m=“canon eos digital camera”, δ=0.8 5d, 9.0 canon, 2.0 camera, 1.0 eos, 7.0 nikon, 2.0 slr, 2.0 Accumulator 3 1 1 2.0 9.0 0 2.0

  16. Outline • Introduction • State-of-the-art techniques • Our algorithms and evaluations • Corrected filtering conditions • EvSCAN: Filtration by SIL • EvITER: Incremental optimization on EvSCAN • Supporting Dynamic Thresholds • Conclusion

  17. EvITER: Progressive Computation • Recall we are checking all substrings • Some of them are quite similar, indicating that they share duplicate computation • An intuition: if m have potential evidence r, then m t is very likely to match r • Formally we proved that • Let ES(m) be the set of “potential evidence” for m, list[t]={s| all dictionary strings that contain token t} • We have ES(m t)ES(m)∪list[t]

  18. List[t] ES(m) … lens, 3.0 … {r1} 22 53 Example • Docoment M: m t “…. cannon eos digital camera lens…” • We know that only r1, r22, r53 are possible to match “cannon eos digital camera lens”

  19. Flow of Evidence • EvITER for “Evidence ITERATION” …

  20. Outline • Introduction • State-of-the-art techniques • Our algorithms and evaluations • Corrected filtering conditions • EvSCAN: Filtration by SIL • EvITER: Incremental optimization on EvSCAN • Supporting Dynamic Thresholds • Conclusion

  21. The Static Threshold Problem • How does this index work so far? • -“Get ready forδ=0.8 please.” • -“Please wait 30min for index generation…” • -“Ready!” • -“Document M1,δ=0.8. Go!” • -“…Extraction complete.” • -“Document M2, and I wantδ=0.9…” • -“Sorry, please wait another 30min for index regeneration…” • -“:-(”

  22. The Static Threshold Problem • This One Seems Better • -“Get ready forδ>=0.8 please.” • -“Please wait 30min for index generation…” • -“Ready!” • -“Document M1,δ=0.8. Go!” • -“…Extraction complete.” • -“Document M2, and I wantδ=0.9…” • -“…Extraction complete.” • “:-)”

  23. Supporting Dynamic Thresholds • An Observation • When δ descends, a string r’s tokens fall into Sig(r) one by one, in the order of their weight ranking. • I.e. any node <sig-token, rid> is “active” when δ is below certain “threshold” u<sig-token, rid>. • We record u<sig-token, rid> in each node and sort all nodes in each list according to the descending order of their u value. • For any given δ, we only need retrieve a prefix of each list to get all “active nodes”

  24. Experimental Datasets • DBLP: 274,788 Paper titles • 1,838,973 URLs

  25. Balance should be reached • Recall our two stages of filtration and verification

  26. Performance (DBLP)

  27. Outline • Introduction • State-of-the-art techniques • Our algorithms and evaluations • Corrected filtering conditions • EvSCAN: Filtration by SIL • EvITER: Incremental optimization on EvSCAN • Supporting Dynamic Thresholds • Conclusion

  28. Conclusion • Our method causes no false negatives • Our method achieves a good balance between the two phases of filtration and verification • We also propose EvITER to eliminate duplicate computation • Our method has both effective & efficient performance

  29. Thank You ! Q&A

  30. References • [1] A. Arasu, V. Ganti, R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918-929, 2006. • [2] K. Chakrabarti, S. Chaudhuri, V. Ganti, D. Xin. An efficient filter for approximate membership checking. In SIGMOD Conference, 2008. • [3] A. Chandel, P. C. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In ICDE, page 28, 2006. • [4] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006. • [5] M.R.Garey and D.S.Johnson. Computers and Intractability: Guidance to the Theory of NP-Completeness. • [6] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491-500, 2001.

  31. References • [7] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257–266, 2008. • [8] C. Li, B,Wang, X. Yang, VGRAM: Improving performance of approximate queries on string collections using variable length grams. In VLDB 2007. • [9] G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31–88, 2001. • [10] S. Sarawagi, A.Kirpal, Efficient set joins on similarity predicates. In SIGMOD Conference, 2004. • [11] A. Singhal. Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4):35-43, 2001. • [12] E. Sutinen and J. Tarhio. On using q-grams locations in approximate string matching. In ESA, pages 327-340, 1995. • [13] W. Wang, C. Xiao, X. Lin, C. Zhang. Efficient approximate entity extraction with edit distance constraints. In SIGMOD Conference, 2009.

More Related