380 likes | 512 Views
The Power of Prefix Search (with a nice open problem). Talk at ADS 2007 in Bertinoro, October 3 rd. Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany. Overview. Part 1 Definition of our prefix search problem Applications Demos of our search engine Part 2
E N D
The Power of Prefix Search(with a nice open problem) Talk at ADS 2007 in Bertinoro, October 3rd Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany
Overview • Part 1 • Definition of our prefix search problem • Applications • Demos of our search engine • Part 2 • Problem definition again • One way to solve it • Another way to solve it • Your way to solve it
Part 1 Definition, Applications, Demos
Problem Definition — Formal • Context-Sensitive Prefix Search • Preprocess • a given collection of text documents such that queries of the following kind can be processed efficiently • Given • an arbitrary set of documentsD • and a range of words W • Compute • all word-in-document pairs (w,d)such that w є W and d є D
Problem Definition — Visual D74 J W Q D3 Q DA • Data is given as • documents containing words • documents have ids (D1, D2, …) • words have ids (A, B, C, …) • Query • given a sorted list of doc ids • and a range of word ids D17 B WU K A D43 D Q D1 A O E W H D92 P U D E M D53 J D E A D78 K L S D27 K L D F D9 E E R D4 K L K A B D88 P A E G Q D2 B F A D32 I L S D H D98 E B A S D13 A O E W H D13 D17 D88 … C D E F G H
Problem Definition — Visual D74 J W Q D3 Q DA • Data is given as • documents containing words • documents have ids (D1, D2, …) • words have ids (A, B, C, …) • Query • given a sorted list of doc ids • and a range of word ids • Answer • all matching word-in-doc pairs • with scores • and positions D17 B WU K A D17 B WU K A D43 D Q D1 A O E W H D92 P U D E M D53 J D E A D78 K L S D27 K L D F D9 E E R D4 K L K A B D88 P A E G Q D88 P A E G Q D2 B F A D32 I L S D H D98 E B A S D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G H
Problem Definition — Visual D74 J W Q D3 Q DA • Data is given as • documents containing words • documents have ids (D1, D2, …) • words have ids (A, B, C, …) • Query • given a sorted list of doc ids • and a range of word ids • Answer • all matching word-in-doc pairs • with scores • and positions D17 B WU K A D17 B WU K A D43 D Q D1 A O E W H D92 P U D E M D53 J D E A D78 K L S D27 K L D F D9 E E R D4 K L K A B D88 P A E G Q D88 P A E G Q D2 B F A D32 I L S D H D98 E B A S D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G H
Application 1: Autocompletion • After each keystroke • display completions of the last query word that lead to the best hits, together with the best such hits • e.g., for the query probabilistic alg display algorithm and algebra and show hits for both
Application 2: Error Correction • As before, but also … • … display spelling variants of completions that would lead to a hit • e.g., for the query probabilistic algorithm also consider a document containing probalistic aigorithm • Implementation • if, say, aigorithm occurs as a misspelling of algorithm, then for every occurrence of aigorithm in the index aigorithm Doc. 17 also add algorithm::aiogorithm Doc. 17
Application 3: Query Expansion • As before, but also … • … display words related to completions that would lead to a hit • e.g., for the query russia metal also consider documents containing russia aluminium • Implementation • for, say, every occurrence of aluminium in the index aluminium Doc. 17 also add (once for every occurrence) s:67:aluminium Doc. 17 and (one once for the whole collection) s:aluminium:67 Doc. 00
Application 4: Faceted Search • As before, but also … • … along with the completions and hits, display a breakdown of the result set by various categories • e.g., for the query algorithm show (prominent) authors of articles containing these words • Implementation • for, say, an article by Camil Detrescu that appeared in SODA 2006, add author:Camil_Demetrescu Doc. 17 venue:SODA Doc. 17 year:2006 Doc. 17 • also add camil:author:Camil_Demetrescu Doc. 17 demetrescu:author:Camil_Demetrescu Doc. 17etc.
Application 5: Semantic Search • As before, but also … • … display “semantic” completions • e.g., for the query beatles musician display instances of the class musician that occur together with the word beatles • Implementation • cannot simply duplicate index entries of an entity for each category it belongs to, e.g. John Lennon is a singer, songwriter, person, human being, organism, guitarist, pacifist, vegetarian, entertainer, musician, … • tricky combination of completions and joins SIGIR’07 and still more applications …
Part 2 Solutions and Open Problem
Solution 1: Inverted Index • For example, probab* alg* given the documents: D13, D17, D88, … (ids of hits for probab*) and the word range : C D E F G (ids for alg*) • Iterate over all words from the given range C (algae) D8, D23, D291, ... D (algarve) D24, D36, D165, ... E (algebra) D13, D24, D88, ... F (algol) D56, D129, D251, ... G (algorithm) D3, D15, D88, ... • Intersect each list with the given one and merge the results D13 D88 D88 …E E G … running time |D|∙ |W| + log |W|∙ merge volume
A General Idea • Precompute inverted lists for ranges of words list for A-D • Note • each prefix corresponds to a word range • ideally precompute list for each possible prefix • too much space • but lots of redundancy
Solution 2: AutoTree SPIRE’06 / JIR’07 • Trick 1: Relative bit vectors • the i-th bit of the root node corresponds to the i-th doc • the i-th bit of any other node corresponds to the i-th set bit of its parent node aachen-zyskowski 1111111111111… corresponds to doc 5 maakeb-zyskowski 1001000111101… corresponds to doc 5 maakeb-stream 1001110… corresponds to doc 10
Solution 2: AutoTree SPIRE’06 / JIR’07 • Tricks 2: Push up the words • For each node, by each set bit, store the leftmost word of that doc that is not already stored by a parent node algorithm D = 5, 7, 10 W = max* advance advance advance advance aachen aachen aachen algol art 1 1 1 1 1 1 1 1 1 1 … maximum manning maximal maximal manner D = 5, 10 (→ 2, 5) report: maximum 1 0 0 0 1 0 0 1 1 1 … mazza middle maple D = 5 report: Ø →STOP 1 0 0 1 1 …
Solution 2: AutoTree SPIRE’06 / JIR’07 • Tricks 3: divide into blocks • and build a tree over each block as shown before
Solution 2: AutoTree SPIRE’06 / JIR’07 • Tricks 3: divide into blocks • and build a tree over each block as shown before
Solution 2: AutoTree SPIRE’06 / JIR’07 • Tricks 3: divide into blocks • and build a tree over each block as shown before • Theorem: • query processing time O(|D| + |output|) • uses no more space than an inverted index • AutoTree Summary: +output-sensitive • not IO-efficient (heavy use of bit-rank operations) • compression not optimal
Parenthesis • Despite its quadratic worst-case complexity, the inverted index is hard to beat in practice • very simple code • lists are highly compressible • perfect locality of access • Number of operations is a deceptive measure • 100 disk seeks take about half a second • in that time can read 200 MB of contiguous data(if stored compressed) • main memory: 100 non-local accesses 10 KB data block data
Solution 3: HYB SIGIR’06 / IR’07 • Flat division of word range into blocks list for A-D list for E-J list for K-N
Solution 3: HYB SIGIR’06 / IR’07 • Flat division of word range into blocks • Replace doc ids by gaps and words by frequency ranks: • Encode both gaps and ranks such that x log2 x bits +0 0+1 10+2 110 1st (A) 0 2nd (C) 10 3rd (D) 111 4th (B) 110 • An actual block of HYB
Solution 3: HYB SIGIR’06 / IR’07 • Flat division of word range into blocks • Theorem: • Let n = number of documents, m = number of words • If blocks are chosen of equal volume ε ∙ n • Then query time ε ∙ n and empiricial entropy HHYB ~ (1+ ε) ∙ HINV • HYB Summary: + IO-efficient (mere scans of data) + very good compression • not output-sensitive
Open Problem • A solution for context-sensitive prefix search which is both output-sensitive and IO-efficient • Note: the interesting queries are those with large D and W but small result set • Similar situation for substring search / suffix arrays • all algorithms with good compression have poor locality of access • But prefix search is easier … • … and more relevant for text search Thank you!
INV vs. HYB — Space Consumption Theorem: The empirical entropy of INV isΣ ni∙ (1/ln 2 + log2(n/ni)) Theorem: The empirical entropy of HYB with block size ε∙nis Σ ni∙ ((1+ε)/ln 2 + log2(n/ni)) ni= number of documents containing i-th word, n = number of documents Nice match of theory and practice
INV vs. HYB — Query Time • Experiment: type ordinary queries from left to right db , dbl , dblp , dblp un , dblp uni , dblp univ , dblp unive , ... INV HYB HYB beats INV by an order of magnitude
Engineering • With HYB, every query is essentially one block scan • perfect locality of access, no sorting or merging, etc. • balanced ratio of read, decompression, processing, etc. • Careful implementation in C++ • Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)
Engineering • With HYB, every query is essentially one block scan • perfect locality of access, no sorting or merging, etc. • balanced ratio of read, decompression, processing, etc. • Careful implementation in C++ • Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)
Engineering • With HYB, every query is essentially one block scan • perfect locality of access, no sorting or merging, etc. • balanced ratio of read, decompression, processing, etc. • Careful implementation in C++ • Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)
Engineering • With HYB, every query is essentially one block scan • perfect locality of access, no sorting or merging, etc. • balanced ratio of read, decompression, processing, etc. • Careful implementation in C++ • Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)
Engineering • With HYB, every query is essentially one block scan • perfect locality of access, no sorting or merging, etc. • balanced ratio of read, decompression, processing, etc. • Careful implementation in C++ • Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)
System Design — High Level View Compute ServerC++ Web ServerPHP User ClientJavaScript Debugging such an application is hell!
Basic Problem Definition • Definition: Context-sensitive prefix search and completion • Given a query consisting of • sorted list Dof doc ids Doc15Doc183Doc185Doc17351 … • range Wof word ids Word1893 – Word7329 • Compute as a result • all (w, d) w Є W, d Є DDoc15Doc15Doc17351... sorted by doc id Word7014Word5112Word2011… • Refinements • positions Pos12Pos73Pos44... • scores 0.70.30.5...
Basic Problem Definition • For example, dblp uni • set D = document ids from result for dblp • range W = word ids of all words starting with uni →multi-dimensional query processed as sequence of 1½ dimensional queries • For example, intersect completions of resultsfor conf:sigir author: and conf:sigmod author:
Basic Problem Definition • For example, dblp uni • set D = document ids from result for dblp • range W = word ids of all words starting with uni →multi-dimensional query processed as sequence of 1½ dimensional queries • For example, intersect completions of resultsfor conf:sigir author: and conf:sigmod author: → efficient, because completions are from small range
Conclusions • Context-sensitive prefix search and completion • is a fundamental operation • supports autocompletion search, semantic search, faceted search, DB-style selects and joins, ontology search, … • efficient support via HYB index • very good compression properties • perfect locality of access • Some open issues • integrate top-k query processing • what else can we do with it? • very short prefixes