330 likes | 447 Views
Engineering a Set Intersection Algorithm for Information Retrieval. Alex Lopez-Ortiz UNB / InterNAP. Joint work with Ian Munro and Erik Demaine. Overview. Web Search Engine Basics Algorithms for set operations Theoretical Analysis Experimental Analysis
E N D
Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine
Overview • Web Search Engine Basics • Algorithms for set operations • Theoretical Analysis • Experimental Analysis • Engineering an Improved Algorithm • Conclusions
Web Search Engine Basics • Crawl: sequential gathering process • Document ID (DocID) for each web page 2 SIGIR 1 • Cool sites: • SIGIR • SIGACT • SIGCOMM 3 SIGACT SIGCOMM 4 http://acm.org/home.html
Indexing: List of entries of type <word, docID1 , docID2 , . . . , > E.g. <cool, 1> <SIGACT, 1, 3> <SIGCOMM, 1, 4> <SIG, 1, 2, 3, 4> 1 2 3 4 SIGIR SIGACT SIGCOMM • Cool sites: • SIGIR • SIGACT • SIGCOMM
Postings set: Set of docID’s containing a word or pattern. SIGACT {1,3} SIGCOMM {1,4} 1 2 3 4 SIGIR SIGACT SIGCOMM • Cool sites: • SIGIR • SIGACT • SIGCOMM
Search Engine Basics (cont.) Postings set stored implicitly/explicitly in a string matching data structure • PAT tree/array • Inverted word index • Suffix trees • KMP (grep) ...
String Matching Problem • Different performance characteristics for each solution • Time/Space tradeoff (empirical) • Linear time/linear space lower bound [Demaine/L-O, SODA 2001]
Search Engine Basics (cont.) A user query is of the form: keyword1keyword2 … keywordn where is one of {and,or} E.g. computer and science or internet
Evaluating a Boolean Query The interpretation of a boolean query is the mapping: • keyword postings set • and (set intersection) • or (set union) E.g. {computer} {science} {internet}
Set Operations for Web Search Engines • Average postings set size > 10 million • Postings set are sorted
Intersection Time Complexity • Worst case linear on size of postings sets: Θ(n) {1,3,5,7} {1,3,5,7} • On size of output? {1,3,5,7} {2,4,6,8}
Adaptive Algorithms • Assume the intersection is empty. What is the min number of comparisons needed to ascertain this fact? Examples {1,2,3,4} {5,6,7,8}
Much ado About Nothing • A sequence of comparisons is a proof of non-intersectionif every possible instance of sets satisfying said sequence has empty intersection. • E.g. • A={1,3,5,7} • B={2,4,6,8} • a1 < b1 < a2 < b2 < a3 < b3 < a4 < b4
Adaptive Algorithms • In [SODA 2000] we proposed an adaptive algorithm that intersects k sets in: k · | shortest proof of non-intersection | • steps. Ideal for crawled, “bursty” data sets
How does it work? • <SIGACT, 1, 3, i, n> 1,_,3,... i n DocID universe set
Measuring Performance • 100MB Web Crawl • 5000 queries from Google
Baseline Standard Algorithm • Sort sets by size • Candidate answer set is smallest set • For each set S in increasing order by size • For each element e in candidate set • Binary search for e in S • If e is not found remove from candidate set • Remove elements before e in S
Side by Side Middle Bound Lower Bound
Possible Improvements • Adaptive performs best in two-three sets • Traditional algorithm often terminates after first pair of sets • Galloping seems better than binary search • Adaptive keeps a dynamic definition of “smallest set” • Candidate elements aggressively tested
Example {6, 7,10,11,14} {4, 8,10,11,15} {1, 2, 4, 5, 7, 8, 9}
Experimental Results Test orthogonally each possible improvement • Cyclic or Two Smallest • Symmetric • Update Smallest • Advance on Common Element • Gallop Factor/Binary Search
Small Adaptive Combines best of Adaptive and Two-Smallest • Two-smallest • Symmetric • Advance on common element • Update on smallest • Gallop with factor 2
Small Adaptive • Small Adaptive is faster than Two-Smallest • Aggregate speed-up 2.9x comparisons • Faster than Adaptive
Conclusions • Faster intersection algorithm for Web Search Engines • Adaptive measure for set operations • Information theoretic “middle bound” • Standard speed-up techniques for other settings THE END
Query Log Total # of elements in a query Number of queries for each total size
Example {6, 7,10,11,14} {4, 8,10,11,15} {1, 2, 4, 5, 7, 8, 9, 12}