230 likes | 353 Views
Fast Indexes and Algorithms For Set Similarity Selection Queries. M. Hadjieleftheriou Chandel N. Koudas D. Srivastava. Strings as sets. s 1 = “Main St. Maine”: ‘Main’ ‘St.’ ‘Maine’ ‘Mai’ ‘ain’ ‘in ’ ‘n S’ ‘ St’ ‘St.’ ‘t. ’ … s 2 = “Main St. Main”: ‘Main’ ‘St.’ ‘Main’
E N D
Fast Indexes and AlgorithmsFor Set Similarity Selection Queries M. Hadjieleftheriou Chandel N. Koudas D. Srivastava
Strings as sets • s1 = “Main St. Maine”: • ‘Main’ ‘St.’ ‘Maine’ • ‘Mai’ ‘ain’ ‘in ’ ‘n S’ ‘ St’ ‘St.’ ‘t. ’ … • s2 = “Main St. Main”: • ‘Main’ ‘St.’ ‘Main’ • How similar is s1 and s2 ?
TF/IDF weighted similarity • Inverse Document Frequency (idf): • ‘Main’ is common • ‘Maine’ is not • idf(t) = log2[1 + N / df(t)] • Term Frequency (tf): • ‘Main’ appears twice in s2 • Similarity: • Inner Product
Is TF important? • Information retrieval: • Given a query string retrieve relevant documents • Relational databases: • Given a query string retrieve relevant strings • In practice TF is small in many applications
IDF similarity • Query q = {t1, …, tn} • Set s = {r1, …, rm} • Length len(s) = (t 2 s idf(t)2)1/2 • I(q, s) = t 2 s \ q idf(t)2 / len(s) len(q) • IDF is as good as TF/IDF in practice!
How can I build an index? • Let w(t, s) = idf(t) / len(s) • Then I(q, s) = t 2 q \ s w(t, s) w(t, q) • So • Decompose strings into tokens • Compute the idf of each token • Create one inverted list per token • Sort lists by string id: Do a merge join • Sort lists by w: Run TA/NRA
Example: Sort by w • NRA: • Round robin list accesses • Main memory hash table • Computes lower and upper bounds per entry
Semantic properties of IDF • Order Preservation: • For all t1 t2: if w(t1, s) < w(t1, r), then w(t2, s) < w(t2, r) • Length Boundedness: • Query q, set s, threshold • I(q, s) >= ) len(q) < len(s) < len(q) /
Improved NRA • Order Preservation determines if a given set appears in a list or not • ti: encounter s1, then s2 • tk: encounter s2 first • Length Boundedness restricts the search in a small portion of lists
Something surprising • Lemma: NRA reads arbitrarily more elements than iNRA • Lemma: NRA reads arbitrarily more elements than any algorithm that uses the Length Boundednessproperty
Any other strategies? • NRA style is breadth-first • Try depth-first: • Sort query lists in decreasing idf order • Let q = {t1, …, tn} and idf(t1) > idf(t2) > …> idf(tn) • Let i be the maximum length a set s in ti can have s.t. I(q, s) >= , assuming that s exists in all tk > ti • i = I <= k <= n idf(tk)2 / len(q) • i is a natural cutoff point • 1 > 2 > … > n
Shortest-First • Sort q={t1, …, tn} in decreasing idf order • Let candidate set C • For 1 <= i <= n • Skip to first entry with len(s) >= len(q) • Compute i • Let i = min(i, len(q) / ) • Repeat • s = pop next element from ti • Maintain lower/upper bounds of entries in C • Until len(s) > max(max len C, i)
Comparison with NRA • Lemma: Let q={t1, …, tn} and d the maximum depth SF descents over all lists. In the worst case iNRA will read (d – 1)(n – 1) elements more than SF • But surprisingly
A hybrid strategy • Run iNRA normally • Use i and max len C to stop reading from a particular list • This guarantees that iNRA stops with or before SF • Drawback of NRA variants: • Very high book keeping cost compared to SF
Experiments • DBLP, IMDB and YellowPages datasets • Actors, movies, authors, businesses etc. • Vary threshold, query size, query strings and mistakes • Test wall-clock time, pruning power • Algorithms:NRA, TA, iNRA, iTA, SF, Hybrid, Sort-by-id, Improved SQL based
Wall-clock time vs. Query size TA SF NRA Sort-by-id iTA
Conclusion • Proposed a simplified TF/IDF measure • Identified strong monotonicity properties • Used the properties to design efficient algorithms • SF works best overall in practice • Achieves sub-second answers in most practical cases
Pruning power vs. Query size iTA TA NRA