310 likes | 418 Views
Efficient Algorithms for Approximate Member Extraction Using Signature-based Inverted Lists. Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin University of China. Introduction: An Example. A dictionary of strings we are interested in E.g. product names, postal addresses…
E N D
Efficient Algorithms for Approximate Member Extraction Using Signature-based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin University of China
Introduction: An Example • A dictionary of strings we are interested in • E.g. product names, postal addresses… • We are going to locate their “approximate apparences” in a series of documents. • See the meaning of “approximate apparence” in the following example:
Problem Definition • Given a dictionary R and a threshold δ, extract all proper substrings m from input documents S such that there exists r ∈R, and Similarity (r, m) ≥δ(or Distance(r, m) ≤k). • Here we call r a piece of evidence for m. • Similarity() is a function measuring the similarity of two strings • Strings are viewed as sets of tokens (words) • An example for Sim(): Jaccard similarity:
Outline • Introduction • State-of-the-art techniques • The filtration-verification framework • K-signature scheme • Inverted Signature-based Hashtable • Our algorithms and evaluations • Conclusion
Why pre-pruning is needed • We need spot evidence to decide whether a substring m should be extracted • Simple verification on all dictionary strings may be inefficient • Pre-pruning and post-verifying is beneficial • But should it be running-speed-oriented or filtering-power-oriented? • Less time or less survivors?
More(less) filtration time Strong(weak) filtration power Overall performance =Tf+Tv????? Less(more) verification time Fewer(more) candidates The issue of compromise comes again • Balance between the two stages should be reached:
Outline • Introduction • State-of-the-art techniques • The filtration-verification framework • K-signature scheme • Inverted Signature-based Hashtable • Our algorithms and evaluations • Conclusion
K-signature scheme • K-signature scheme • Proposed by Chakrabarti et al. (SIGMOD 2008) • Choose several top-weighted tokens in a string as signatures to represent it: s => Sig(s) • Observation: if r cannot match m, r is likely to have insufficient signature overlapping with m • K is a parameter for filtration power tuning • Potential evidence loss • A counter-example found when k=3 • We tried and only proved that it works for k=1 and k=∞
Outline • Introduction • State-of-the-art techniques • The filtration-verification framework • K-signature scheme • Inverted Signature-based Hashtable • Our algorithms and evaluations • Conclusion
Inverted Signature-based Hashtable • Proposed by Chakrabarti et al. (SIGMOD 2008) • Each dictionary string encoded into a solid 0-1 matrix • An ‘1’ for each occurrence of a <token,sig-token> tuple (‘1’- rectangle) • Bitwise-or all solid matrices to get the matrix of R • Observation: if m is an approximate member of R, the matrix of m must have enough intersections with that of R. • Formalized into an NPC problem • Solution causes too weak filtering power
Outline • Introduction • State-of-the-art techniques • Our algorithms and evaluations • Corrected filtering conditions • EvSCAN: Filtration by SIL • EvITER: Incremental optimization on EvSCAN • Supporting Dynamic Thresholds • Conclusion
Too strict ! Proved by us Our proposed theorem • If Sim(m,r) ≥δ, what do we have ? wt(Sig(m)∩Sig(r)) ≥ τ(m) wt(Sig(m)∩Sig(r)) ≥ min{τ(m),τ(r) } • So the threshold does not remain constant • involves unknown evidence • Our solution: Use inverted lists to count sig-token overlappings. • Note that sig-tokens usually have low document frequency (e.g. IDF as weights)
Outline • Introduction • State-of-the-art techniques • Our algorithms and evaluations • Corrected filtering conditions • EvSCAN: Filtration by SIL • EvITER: Incremental optimization on EvSCAN • Supporting Dynamic Thresholds • Conclusion
5d, 9.0 canon, 2.0 camera, 1.0 eos, 7.0 nikon, 2.0 slr, 2.0 1 1 2 1 2 2 3 3 Signature-based Inverted Lists • Lists indexed by sig-tokens • Each sig-token of a string creates a node (containing the string’s id) in the corresponding list. • E.g. R = { r1 = “canon eos 5d digital camera", r2 =“nikon digital slrcamera”, r3=“canon slr camera”}. • wt(digital, camera, canon, nikon, slr, eos, 5d) = (1, 1, 2, 2, 2, 7 ,9).
2 3 2 1 2 Qualified! Filtration by SIL • Using an array called “accumulator” to compute the overlapped sig weight wt(Sig(m)∩Sig(r)) • E.g. m=“canon eos digital camera”, δ=0.8 5d, 9.0 canon, 2.0 camera, 1.0 eos, 7.0 nikon, 2.0 slr, 2.0 Accumulator 3 1 1 2.0 9.0 0 2.0
Outline • Introduction • State-of-the-art techniques • Our algorithms and evaluations • Corrected filtering conditions • EvSCAN: Filtration by SIL • EvITER: Incremental optimization on EvSCAN • Supporting Dynamic Thresholds • Conclusion
EvITER: Progressive Computation • Recall we are checking all substrings • Some of them are quite similar, indicating that they share duplicate computation • An intuition: if m have potential evidence r, then m t is very likely to match r • Formally we proved that • Let ES(m) be the set of “potential evidence” for m, list[t]={s| all dictionary strings that contain token t} • We have ES(m t)ES(m)∪list[t]
List[t] ES(m) … lens, 3.0 … {r1} 22 53 Example • Docoment M: m t “…. cannon eos digital camera lens…” • We know that only r1, r22, r53 are possible to match “cannon eos digital camera lens”
Flow of Evidence • EvITER for “Evidence ITERATION” …
Outline • Introduction • State-of-the-art techniques • Our algorithms and evaluations • Corrected filtering conditions • EvSCAN: Filtration by SIL • EvITER: Incremental optimization on EvSCAN • Supporting Dynamic Thresholds • Conclusion
The Static Threshold Problem • How does this index work so far? • -“Get ready forδ=0.8 please.” • -“Please wait 30min for index generation…” • -“Ready!” • -“Document M1,δ=0.8. Go!” • -“…Extraction complete.” • -“Document M2, and I wantδ=0.9…” • -“Sorry, please wait another 30min for index regeneration…” • -“:-(”
The Static Threshold Problem • This One Seems Better • -“Get ready forδ>=0.8 please.” • -“Please wait 30min for index generation…” • -“Ready!” • -“Document M1,δ=0.8. Go!” • -“…Extraction complete.” • -“Document M2, and I wantδ=0.9…” • -“…Extraction complete.” • “:-)”
Supporting Dynamic Thresholds • An Observation • When δ descends, a string r’s tokens fall into Sig(r) one by one, in the order of their weight ranking. • I.e. any node <sig-token, rid> is “active” when δ is below certain “threshold” u<sig-token, rid>. • We record u<sig-token, rid> in each node and sort all nodes in each list according to the descending order of their u value. • For any given δ, we only need retrieve a prefix of each list to get all “active nodes”
Experimental Datasets • DBLP: 274,788 Paper titles • 1,838,973 URLs
Balance should be reached • Recall our two stages of filtration and verification
Outline • Introduction • State-of-the-art techniques • Our algorithms and evaluations • Corrected filtering conditions • EvSCAN: Filtration by SIL • EvITER: Incremental optimization on EvSCAN • Supporting Dynamic Thresholds • Conclusion
Conclusion • Our method causes no false negatives • Our method achieves a good balance between the two phases of filtration and verification • We also propose EvITER to eliminate duplicate computation • Our method has both effective & efficient performance
Thank You ! Q&A
References • [1] A. Arasu, V. Ganti, R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918-929, 2006. • [2] K. Chakrabarti, S. Chaudhuri, V. Ganti, D. Xin. An efficient filter for approximate membership checking. In SIGMOD Conference, 2008. • [3] A. Chandel, P. C. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In ICDE, page 28, 2006. • [4] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006. • [5] M.R.Garey and D.S.Johnson. Computers and Intractability: Guidance to the Theory of NP-Completeness. • [6] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491-500, 2001.
References • [7] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257–266, 2008. • [8] C. Li, B,Wang, X. Yang, VGRAM: Improving performance of approximate queries on string collections using variable length grams. In VLDB 2007. • [9] G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31–88, 2001. • [10] S. Sarawagi, A.Kirpal, Efficient set joins on similarity predicates. In SIGMOD Conference, 2004. • [11] A. Singhal. Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4):35-43, 2001. • [12] E. Sutinen and J. Tarhio. On using q-grams locations in approximate string matching. In ESA, pages 327-340, 1995. • [13] W. Wang, C. Xiao, X. Lin, C. Zhang. Efficient approximate entity extraction with edit distance constraints. In SIGMOD Conference, 2009.