The CompleteSearch Engine: Interactive, Efficient, and Towards IR & DB integration

The CompleteSearch Engine:Interactive, Efficient,and Towards IR & DB integration Vortrag an der Universität Trier, 13ter Februar 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea, Deb Majumdar, Christian Mortensen, Fabian Suchanek, Markus Tetzlaff, Thomas Warken, Ingmar Weber, …

IR versus DB (simplified view) • IR system (search engine) single data structure and query algorithm, optimized for ranked retrieval on textual data highly compressible and high locality of access ranking is an integral part can't do even simple selects, joins, etc. • DB system (relational) variety of indices and query algorithms, to suit all sorts of complex queries on structured data  space overhead and limited locality of access  no integrated ranked retrieval  can do complex selects, joins, … (SQL) scales very wellbut special-purpose general-purposebut slow on large data

Our work (in a nutshell) • The CompleteSearch engine novel data structure and query algorithm for context-sensitive prefix search and completion  highly compressible and high locality of access  IR-style ranked retrieval  DB-style selects and joins  natural blend of the two  subsecond query times for up to a terabyte on a single machine  no transactions, recovery, etc. for low dynamics (few insertions/deletions) other open issues at the end of the talk … fairly general-purposeand scales very well

Context-Sensitive Autocompletion • Complete to words that would lead to a hit • saves typing, avoids overspecification of query, find out about formulations used, error correction, etc. • Complete to phrases • for the phrase uni trier • add the word uni_trierto the index • Complete to subwords • for the compound word eigenproblem • add the word problemto the index • Complete to arbitrary substrings • there are standard techniques • but usually not worth it (in text search)

Semantic Completion • Complete to instances of categories • for the author Henning Fernau • add henning:fernau::authorand fernau::henning:author • Complete to names of categories • for the author Henning Fernau • add author:henning_fernau • Refine search result by category (faceted search) • add ct:conference:stacs • add ct:author:henning_fernau • add ct:year:2005 • proactively launch query with ct: appended

DB-style joins • Find authors which have published at SIGIR and SIGMOD • must collect information from several documents • no way to do this with standard keyword search • with our context-sensitive prefix completion, we can launch conference:sigir author:* conference:sigmod author:* • and intersect the list of completions(not documents) • Like that can realize any kind of join • note that adding conference:stacsauthor:henning_fernauyear:2005 etc. effectively creates a table with schema (conference, author, year, publication)

Incorporating Ontologies (ongoing work) • Consider an entity like John Lennonwho we know was a • singer, songwriter, person, human being, organism, guitarist, pacifist, vegetarian, entertainer, musician, … • We cannot add all the annotations to every occurrence of John Lennon • index size would explode • better to keep annotations separately • But we can • add entity:john_lennon for every occurrence • in a special document about him, add entity:john_lennon along with class:songwriter, class:musician, class:person, … • And then intersect the completions of, for example, • beatles entity: and class:musician entity:

Related Engines suggests whole queries from precompiled list

Related Engines similar to Google Suggest + proactively snaps to one query and shows result

Context-Sensitive Prefix Search D74 J W Q D3 Q DA • Data is given as • documents containing words • documents have ids (D1, D2, …) • words have ids (A, B, C, …) • Query • given a sorted list of doc ids • and a range of word ids D17 B WU K A D43 D Q D1 A O E W H D92 P U D E M D53 J D E A D78 K L S D27 K L D F D9 E E R D4 K L K A B D88 P A E G Q D2 B F A D32 I L S D H D98 E B A S D13 A O E W H D13 D17 D88 … C D E F G H

Context-Sensitive Prefix Search D74 J W Q D3 Q DA • Data is given as • documents containing words • documents have ids (D1, D2, …) • words have ids (A, B, C, …) • Query • given a sorted list of doc ids • and a range of word ids • Answer • all matching word-in-doc pairs • with scores D17 B WU K A D17 B WU K A D43 D Q D1 A O E W H D92 P U D E M D53 J D E A D78 K L S D27 K L D F D9 E E R D4 K L K A B D88 P A E G Q D88 P A E G Q D2 B F A D32 I L S D H D98 E B A S D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G H

Solution via an Inverted Index (INV) • For example, db* given the sorted list of all document ids given the range of word ids matching db* • Iterate over all words from W Word 781(dbms) Doc. 16, Doc. 53, Doc. 591, ... Word 782 (db2) Doc. 3, Doc. 66, Doc. 765, ... Word 783 (dbase) Doc. 25, Doc. 98, Doc. 221, ... Word 784 (dbis) Doc. 67, Doc. 189, Doc. 221, ... Word 785 (dblp) Doc. 16, Doc. 110, Doc. 141, ... • Have to merge the lists Doc. 3, Doc. 16, Doc. 16, Doc. 25, … Word 782, Word 781, Word 785, Word 783, … query time = output size ∙ log(size of W)

Solution via an Inverted Index (INV) • For example, db* uni* given the doc id list: Doc. 3, Doc. 16, Doc. 18, Doc. 25, … (hits for db*) given the range of word ids matching uni* • Iterate over all words from W Word 578 (uniform) Doc. 8, Doc. 23, Doc. 291, ... Word 579 (unit) Doc. 24, Doc. 36, Doc. 165, ... Word 580 (uni trier) Doc. 3, Doc. 18, Doc. 66, ... Word 581 (unique) Doc. 56, Doc. 129, Doc. 251, ... Word 582 (university) Doc. 18, Doc. 21, Doc. 25, ... • Intersect each list with D, then merge Doc. 3, Doc. 18, Doc. 18, Doc. 25, …Word 580, Word 580, Word 582, Word 582, … query time = size of D ∙ size of W + merging

The Inverted Index (INV) — Problems • Asymptotic time complexity is bad (for our problem) • with INV we either have to merge/sort a lot • or intersect the same list over and over again • Still a tough baseline to beat in practice • highly compressible • half the space on disk means half the time to read it • INV has very good locality of access • the ratio random access time/sequential access time is 50,000 for disk, and still up to 100 for main memory • simple code • instruction cache, branch prediction, etc.

A Tree-Based Index (AutoTree) SPIRE 2006 • Output-sensitive behaviour • query time = size of result list • anytime algorithm: produces result element in every step • Beats the inverted index by a factor of 5 • but only in main memory • heavy use of bit rank data structures (to compute number of set bits before a given position in constant time)

A Hybrid Index (HYB) SIGIR 2006 • HYB has a block for each word range, conceptually: • Replace doc ids by gaps and words by frequency ranks: • Encode both gaps and ranks such that x  log2 x bits +0  0+1  10+2  110 1st (A)  0 2nd (C)  10 3rd (D)  111 4th (B)  110 • An actual block of HYB

INV vs. HYB — Space Consumption Theorem: The empirical entropy of INV isΣ ni∙ (1/ln 2 + log2(n/ni)) Theorem: The empirical entropy of HYB with block size ε∙nis Σ ni∙ ((1+ε)/ln 2 + log2(n/ni)) ni= number of documents containing i-th word, n = number of documents Nice match of theory and practice

INV vs. HYB — Query Time • Experiment: type ordinary queries from left to right db , dbl , dblp , dblp un , dblp uni , dblp univ , dblp unive , ... INV HYB HYB beats INV by an order of magnitude

Engineering • With HYB, every query is essentially one block scan • perfect locality of access, no sorting or merging, etc. • balanced ratio of read, decompression, processing, etc. • Careful implementation in C++ • Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)

System Design — High Level View Compute ServerC++ Web ServerPHP User ClientJavaScript Debugging such an application is hell!

Conclusions • Summary • central mechanism for context-sensitive range search • very efficient in space and time, scales very well • combines IR-style ranked retrieval with DB-style selects and joins • support for interactive / semantic / faceted / ontology search • On our TODO list • achieve both output-sensitivity and locality of access • integrate top-k query processing • find out which SQL queries can be supported efficiently? • deal with high dynamics (many insertions/deletions) Thank you!

Basic Problem Definition • Definition: Context-sensitive prefix search and completion • Given a query consisting of • sorted list Dof doc ids Doc15Doc183Doc185Doc17351 … • range Wof word ids Word1893 – Word7329 • Compute as a result • all (w, d) w Є W, d Є DDoc15Doc15Doc17351... sorted by doc id Word7014Word5112Word2011… • Refinements • positions Pos12Pos73Pos44... • scores 0.70.30.5...

Basic Problem Definition • For example, dblp uni • set D = document ids from result for dblp • range W = word ids of all words starting with uni →multi-dimensional query processed as sequence of 1½ dimensional queries • For example, intersect completions of resultsfor conf:sigir author: and conf:sigmod author:

Basic Problem Definition • For example, dblp uni • set D = document ids from result for dblp • range W = word ids of all words starting with uni →multi-dimensional query processed as sequence of 1½ dimensional queries • For example, intersect completions of resultsfor conf:sigir author: and conf:sigmod author: → efficient, because completions are from small range

Conclusions • Context-sensitive prefix search and completion • is a fundamental operation • supports autocompletion search, semantic search, faceted search, DB-style selects and joins, ontology search, … • efficient support via HYB index • very good compression properties • perfect locality of access • Some open issues • integrate top-k query processing • what else can we do with it? • very short prefixes

The CompleteSearch Engine: Interactive, Efficient, and Towards IR & DB integration