1 / 32

Engineering a Set Intersection Algorithm for Information Retrieval

Engineering a Set Intersection Algorithm for Information Retrieval. Alex Lopez-Ortiz UNB / InterNAP. Joint work with Ian Munro and Erik Demaine. Overview. Web Search Engine Basics Algorithms for set operations Theoretical Analysis Experimental Analysis

russ
Download Presentation

Engineering a Set Intersection Algorithm for Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine

  2. Overview • Web Search Engine Basics • Algorithms for set operations • Theoretical Analysis • Experimental Analysis • Engineering an Improved Algorithm • Conclusions

  3. Web Search Engine Basics • Crawl: sequential gathering process • Document ID (DocID) for each web page 2 SIGIR 1 • Cool sites: • SIGIR • SIGACT • SIGCOMM 3 SIGACT SIGCOMM 4 http://acm.org/home.html

  4. Indexing: List of entries of type <word, docID1 , docID2 , . . . , > E.g. <cool, 1> <SIGACT, 1, 3> <SIGCOMM, 1, 4> <SIG, 1, 2, 3, 4> 1 2 3 4 SIGIR SIGACT SIGCOMM • Cool sites: • SIGIR • SIGACT • SIGCOMM

  5. Postings set: Set of docID’s containing a word or pattern. SIGACT {1,3} SIGCOMM {1,4} 1 2 3 4 SIGIR SIGACT SIGCOMM • Cool sites: • SIGIR • SIGACT • SIGCOMM

  6. Search Engine Basics (cont.) Postings set stored implicitly/explicitly in a string matching data structure • PAT tree/array • Inverted word index • Suffix trees • KMP (grep) ...

  7. String Matching Problem • Different performance characteristics for each solution • Time/Space tradeoff (empirical) • Linear time/linear space lower bound [Demaine/L-O, SODA 2001]

  8. Search Engine Basics (cont.) A user query is of the form: keyword1keyword2 …  keywordn where  is one of {and,or} E.g. computer and science or internet

  9. Evaluating a Boolean Query The interpretation of a boolean query is the mapping: • keyword postings set • and (set intersection) • or  (set union) E.g. {computer}  {science}  {internet}

  10. Set Operations for Web Search Engines • Average postings set size > 10 million • Postings set are sorted

  11. Intersection Time Complexity • Worst case linear on size of postings sets: Θ(n) {1,3,5,7}  {1,3,5,7} • On size of output? {1,3,5,7}  {2,4,6,8}

  12. Adaptive Algorithms • Assume the intersection is empty. What is the min number of comparisons needed to ascertain this fact? Examples {1,2,3,4}  {5,6,7,8}

  13. Much ado About Nothing • A sequence of comparisons is a proof of non-intersectionif every possible instance of sets satisfying said sequence has empty intersection. • E.g. • A={1,3,5,7} • B={2,4,6,8} • a1 < b1 < a2 < b2 < a3 < b3 < a4 < b4

  14. Adaptive Algorithms • In [SODA 2000] we proposed an adaptive algorithm that intersects k sets in: k · | shortest proof of non-intersection | • steps. Ideal for crawled, “bursty” data sets

  15. How does it work? • <SIGACT, 1, 3, i, n> 1,_,3,... i n DocID universe set

  16. Measuring Performance • 100MB Web Crawl • 5000 queries from Google

  17. Baseline Standard Algorithm • Sort sets by size • Candidate answer set is smallest set • For each set S in increasing order by size • For each element e in candidate set • Binary search for e in S • If e is not found remove from candidate set • Remove elements before e in S

  18. Upper Bound: Adaptive/Traditional Two-Smallest Algorithm

  19. Lower Bound: Adaptive/Shortest Proof

  20. Middle Bound: Adaptive/ Encoding of Shortest Proof

  21. Side by Side Middle Bound Lower Bound

  22. Possible Improvements • Adaptive performs best in two-three sets • Traditional algorithm often terminates after first pair of sets • Galloping seems better than binary search • Adaptive keeps a dynamic definition of “smallest set” • Candidate elements aggressively tested

  23. Example {6, 7,10,11,14} {4, 8,10,11,15} {1, 2, 4, 5, 7, 8, 9}

  24. Experimental Results Test orthogonally each possible improvement • Cyclic or Two Smallest • Symmetric • Update Smallest • Advance on Common Element • Gallop Factor/Binary Search

  25. Binary Search vs. Gallop

  26. Advance on Common Element

  27. Small Adaptive Combines best of Adaptive and Two-Smallest • Two-smallest • Symmetric • Advance on common element • Update on smallest • Gallop with factor 2

  28. Small Adaptive

  29. Small Adaptive • Small Adaptive is faster than Two-Smallest • Aggregate speed-up 2.9x comparisons • Faster than Adaptive

  30. Conclusions • Faster intersection algorithm for Web Search Engines • Adaptive measure for set operations • Information theoretic “middle bound” • Standard speed-up techniques for other settings THE END

  31. Query Log Total # of elements in a query Number of queries for each total size

  32. Example {6, 7,10,11,14} {4, 8,10,11,15} {1, 2, 4, 5, 7, 8, 9, 12}

More Related