1 / 23

Fast Indexes and Algorithms For Set Similarity Selection Queries

Fast Indexes and Algorithms For Set Similarity Selection Queries. M. Hadjieleftheriou Chandel N. Koudas D. Srivastava. Strings as sets. s 1 = “Main St. Maine”: ‘Main’ ‘St.’ ‘Maine’ ‘Mai’ ‘ain’ ‘in ’ ‘n S’ ‘ St’ ‘St.’ ‘t. ’ … s 2 = “Main St. Main”: ‘Main’ ‘St.’ ‘Main’

Download Presentation

Fast Indexes and Algorithms For Set Similarity Selection Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Fast Indexes and AlgorithmsFor Set Similarity Selection Queries M. Hadjieleftheriou Chandel N. Koudas D. Srivastava

  2. Strings as sets • s1 = “Main St. Maine”: • ‘Main’ ‘St.’ ‘Maine’ • ‘Mai’ ‘ain’ ‘in ’ ‘n S’ ‘ St’ ‘St.’ ‘t. ’ … • s2 = “Main St. Main”: • ‘Main’ ‘St.’ ‘Main’ • How similar is s1 and s2 ?

  3. TF/IDF weighted similarity • Inverse Document Frequency (idf): • ‘Main’ is common • ‘Maine’ is not • idf(t) = log2[1 + N / df(t)] • Term Frequency (tf): • ‘Main’ appears twice in s2 • Similarity: • Inner Product

  4. Is TF important? • Information retrieval: • Given a query string retrieve relevant documents • Relational databases: • Given a query string retrieve relevant strings • In practice TF is small in many applications

  5. IDF similarity • Query q = {t1, …, tn} • Set s = {r1, …, rm} • Length len(s) = (t 2 s idf(t)2)1/2 • I(q, s) = t 2 s \ q idf(t)2 / len(s) len(q) • IDF is as good as TF/IDF in practice!

  6. How can I build an index? • Let w(t, s) = idf(t) / len(s) • Then I(q, s) = t 2 q \ s w(t, s) w(t, q) • So • Decompose strings into tokens • Compute the idf of each token • Create one inverted list per token • Sort lists by string id: Do a merge join • Sort lists by w: Run TA/NRA

  7. Example: Sort by id

  8. Example: Sort by w • NRA: • Round robin list accesses • Main memory hash table • Computes lower and upper bounds per entry

  9. Semantic properties of IDF • Order Preservation: • For all t1 t2: if w(t1, s) < w(t1, r), then w(t2, s) < w(t2, r) • Length Boundedness: • Query q, set s, threshold  • I(q, s) >= ) len(q) < len(s) < len(q) / 

  10. Improved NRA • Order Preservation determines if a given set appears in a list or not • ti: encounter s1, then s2 • tk: encounter s2 first • Length Boundedness restricts the search in a small portion of lists

  11. Something surprising • Lemma: NRA reads arbitrarily more elements than iNRA • Lemma: NRA reads arbitrarily more elements than any algorithm that uses the Length Boundednessproperty

  12. Any other strategies? • NRA style is breadth-first • Try depth-first: • Sort query lists in decreasing idf order • Let q = {t1, …, tn} and idf(t1) > idf(t2) > …> idf(tn) • Let i be the maximum length a set s in ti can have s.t. I(q, s) >= , assuming that s exists in all tk > ti • i = I <= k <= n idf(tk)2 /  len(q) • i is a natural cutoff point • 1 > 2 > … > n

  13. Shortest-First • Sort q={t1, …, tn} in decreasing idf order • Let candidate set C • For 1 <= i <= n • Skip to first entry with len(s) >=  len(q) • Compute i • Let i = min(i, len(q) / ) • Repeat • s = pop next element from ti • Maintain lower/upper bounds of entries in C • Until len(s) > max(max len C, i)

  14. Comparison with NRA • Lemma: Let q={t1, …, tn} and d the maximum depth SF descents over all lists. In the worst case iNRA will read (d – 1)(n – 1) elements more than SF • But surprisingly

  15. A hybrid strategy • Run iNRA normally • Use i and max len C to stop reading from a particular list • This guarantees that iNRA stops with or before SF • Drawback of NRA variants: • Very high book keeping cost compared to SF

  16. Experiments • DBLP, IMDB and YellowPages datasets • Actors, movies, authors, businesses etc. • Vary threshold, query size, query strings and mistakes • Test wall-clock time, pruning power • Algorithms:NRA, TA, iNRA, iTA, SF, Hybrid, Sort-by-id, Improved SQL based

  17. Wall-clock time vs. Threshold

  18. Wall-clock time vs. Query size TA SF NRA Sort-by-id iTA

  19. Space

  20. Conclusion • Proposed a simplified TF/IDF measure • Identified strong monotonicity properties • Used the properties to design efficient algorithms • SF works best overall in practice • Achieves sub-second answers in most practical cases

  21. Q&A

  22. Pruning power vs. Threshold

  23. Pruning power vs. Query size iTA TA NRA

More Related