Fast Indexes and Algorithms For Set Similarity Selection Queries

Fast Indexes and AlgorithmsFor Set Similarity Selection Queries M. Hadjieleftheriou Chandel N. Koudas D. Srivastava

Strings as sets • s1 = “Main St. Maine”: • ‘Main’ ‘St.’ ‘Maine’ • ‘Mai’ ‘ain’ ‘in ’ ‘n S’ ‘ St’ ‘St.’ ‘t. ’ … • s2 = “Main St. Main”: • ‘Main’ ‘St.’ ‘Main’ • How similar is s1 and s2 ?

TF/IDF weighted similarity • Inverse Document Frequency (idf): • ‘Main’ is common • ‘Maine’ is not • idf(t) = log2[1 + N / df(t)] • Term Frequency (tf): • ‘Main’ appears twice in s2 • Similarity: • Inner Product

Is TF important? • Information retrieval: • Given a query string retrieve relevant documents • Relational databases: • Given a query string retrieve relevant strings • In practice TF is small in many applications

IDF similarity • Query q = {t1, …, tn} • Set s = {r1, …, rm} • Length len(s) = (t 2 s idf(t)2)1/2 • I(q, s) = t 2 s \ q idf(t)2 / len(s) len(q) • IDF is as good as TF/IDF in practice!

How can I build an index? • Let w(t, s) = idf(t) / len(s) • Then I(q, s) = t 2 q \ s w(t, s) w(t, q) • So • Decompose strings into tokens • Compute the idf of each token • Create one inverted list per token • Sort lists by string id: Do a merge join • Sort lists by w: Run TA/NRA

Example: Sort by id

Example: Sort by w • NRA: • Round robin list accesses • Main memory hash table • Computes lower and upper bounds per entry

Semantic properties of IDF • Order Preservation: • For all t1 t2: if w(t1, s) < w(t1, r), then w(t2, s) < w(t2, r) • Length Boundedness: • Query q, set s, threshold  • I(q, s) >= ) len(q) < len(s) < len(q) / 

Improved NRA • Order Preservation determines if a given set appears in a list or not • ti: encounter s1, then s2 • tk: encounter s2 first • Length Boundedness restricts the search in a small portion of lists

Something surprising • Lemma: NRA reads arbitrarily more elements than iNRA • Lemma: NRA reads arbitrarily more elements than any algorithm that uses the Length Boundednessproperty

Any other strategies? • NRA style is breadth-first • Try depth-first: • Sort query lists in decreasing idf order • Let q = {t1, …, tn} and idf(t1) > idf(t2) > …> idf(tn) • Let i be the maximum length a set s in ti can have s.t. I(q, s) >= , assuming that s exists in all tk > ti • i = I <= k <= n idf(tk)2 /  len(q) • i is a natural cutoff point • 1 > 2 > … > n

Shortest-First • Sort q={t1, …, tn} in decreasing idf order • Let candidate set C • For 1 <= i <= n • Skip to first entry with len(s) >=  len(q) • Compute i • Let i = min(i, len(q) / ) • Repeat • s = pop next element from ti • Maintain lower/upper bounds of entries in C • Until len(s) > max(max len C, i)

Comparison with NRA • Lemma: Let q={t1, …, tn} and d the maximum depth SF descents over all lists. In the worst case iNRA will read (d – 1)(n – 1) elements more than SF • But surprisingly

A hybrid strategy • Run iNRA normally • Use i and max len C to stop reading from a particular list • This guarantees that iNRA stops with or before SF • Drawback of NRA variants: • Very high book keeping cost compared to SF

Experiments • DBLP, IMDB and YellowPages datasets • Actors, movies, authors, businesses etc. • Vary threshold, query size, query strings and mistakes • Test wall-clock time, pruning power • Algorithms:NRA, TA, iNRA, iTA, SF, Hybrid, Sort-by-id, Improved SQL based

Wall-clock time vs. Threshold

Wall-clock time vs. Query size TA SF NRA Sort-by-id iTA

Space

Conclusion • Proposed a simplified TF/IDF measure • Identified strong monotonicity properties • Used the properties to design efficient algorithms • SF works best overall in practice • Achieves sub-second answers in most practical cases

Q&A

Pruning power vs. Threshold

Pruning power vs. Query size iTA TA NRA

Fast Indexes and Algorithms For Set Similarity Selection Queries

Fast Indexes and Algorithms For Set Similarity Selection Queries

Presentation Transcript

Fast and Practical Algorithms for Computing Runs

Fast Algorithms for Minimum Evolution

Fast Algorithms for Submodular Optimization

Fast Algorithms for Top-k Personalized PageRank Queries

Access Structures for Angular Similarity Queries

Short Queries and Indexes

Continuous Similarity-Based Queries on

Relaxing Join and Selection Queries

Covering Indexes for XML Queries by Prakash Ramanan

Covering Indexes for Branching Path Queries

Fast and selection algorithms with application to median filtering

K-tree/forest: Efficient Indexes for Boolean Queries

Fast Propositional Algorithms for Planning

Probabilistic Similarity Queries in Uncertain Databases

Fast, precise and dynamic distance queries

Fast Updating Algorithms for TCAMs

A Model and Algorithms for Pricing Queries

Complexity and Fast Algorithms for Multiexponentiations

Randomized Algorithms for Selection and Sorting

Relaxing Join and Selection Queries

Fast, precise and dynamic distance queries

Fast Algorithms for Retiming