Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Type Less, Find More:Fast Autocompletion Searchwith a Succinct Index at Google in Mountain View, USA, August 14 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Ingmar Weber

It's useful … • Basic Autocompletion • saves typing • no more information than necessary • find out about formulations used googlism, googlearchy • error correction googel

It's more useful … • Complete to phrases • phrase mountain view→ add wordmountain_view to index • Complete to subwords • compound word eigenproblem → add word problem to index • Complete to category names • author Edleno Moura → add moura:edleno::author edleno::moura:author • Faceted search • add ct:conference:sigir • add ct:author:edleno_moura • add ct:year:2005 all via the same mechanism

Related Engines

Basic Problem Definition • Query • a set D of documents (= hits for the first part of the query) • a range W of words (= potential completions of last word) • Answer • all documents D' from D, containing a word from W • all words W' from W, contained in a document from D • Extensions (see paper at SIGIR'06) • ranking (best hits from D' and best completions from W') • positional information (proximity queries) • First try: inverted index (INV)

Processing 1-word queries with INV • For example, goog* D all documents W all words matchinggoog* • Iterate over all words from W google Doc.18, Doc. 53, Doc. 591, ... googlearchy Doc. 3, Doc. 66, Doc. 765, ... googles Doc. 25, Doc. 98, Doc. 221, ... googling Doc. 67, Doc. 189, Doc. 221, ... googlism Doc. 16, Doc. 110, Doc. 141, ... • Merge the documents lists D'Doc. 3, Doc. 16, Doc. 18, Doc. 25, … • Output all words from range as completions W'google, googlearchy, googles, … Expensive! Trivial for 1-word queries

Processing multi-word queries with INV • For example, goog* mou* DDoc. 3, Doc. 16, Doc. 18, Doc. 25, … (hits forgoog*) W all words matching mou* • Iterate over all words from W mould Doc. 8, Doc. 23, Doc. 291, ... mount Doc. 24, Doc. 36, Doc. 165, ... mountain Doc. 3, Doc. 18, Doc. 66, ... mounting Doc. 56, Doc. 129, Doc. 251, ... moura Doc. 18, Doc. 21, Doc. 25, ... • Intersect each list with D, then merge D'Doc. 3, Doc. 18, Doc. 25, … • Output all words with non-empty intersection W'mountain, moura Most intersection are empty, but INV has to compute them all!

INV — Problems • Asymptotic time complexity is bad (for our problem) • many intersections (one per potential completion) • has to merge/sort (the non-empty intersections) • Still hard to beat INV in practice • highly compressible • half the space on disk means half the time to read it • INV has very good locality of access • the ratio random access time/sequential access time is 50,000 for disk, and still 100 for main memory • simple code • instruction cache, branch prediction, etc.

A Hybrid Index (HYB) • Basic Idea: have lists for ranges of words mould – moura Doc. 3 , Doc. 16 , Doc.18 , Doc. 25 , ... • Problem: not enough to show completions • Solution: store the word(s) along with each doc id mould – moura Doc. 3 , Doc. 16 , Doc.18 , Doc. 25 , ... mould moura mount mould mountain mounting moura But this looks very wasteful

HYB — Details • HYB has a block for each word range, conceptually: • Replace doc ids by gaps and words by frequency ranks: • Encode both gaps and ranks such that x  log2 x bits +0  0+1  10+2  110 1st (A)  0 2nd (C)  10 3rd (D)  111 4th (B)  110 • An actual block of HYB How well does it compress? Which block size?

INV vs. HYB — Space Consumption Theorem: The empirical entropy of INV isΣ ni∙ (1/ln 2 + log2(n/ni)) Theorem: The empirical entropy of HYB with block size ε∙nis Σ ni∙ ((1+ε)/ln 2 + log2(n/ni)) ni= number of documents containing i-th word, n = number of documents Nice match of theory and practice

INV vs. HYB — Query Time • Theoretical analysis  see paper at SIGIR'06 • Experiment: type ordinary queries from left to right • go , goo , goog , googl , google , google mo , google mou , ... INV HYB HYB better by an order of magnitude

System Design — High Level View Compute ServerC++ Web ServerPHP User ClientJavaScript Debugging such an application is hell!

Summary of Results • Properties of HYB • highly compressible (just like INV) • fast prefix-completion queries (perfect locality of access) • fast indexing (no full inversion necessary) • Autocompletion and more • phrase and subword completion, semantic completion, XML support, … • faceted search (Workshop Talk on Thursday) • efficient DB joins: author[sigir sigmod] NEW all with one and the same (efficient) mechanism

Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Presentation Transcript

SharePoint Search

Locality Sensitive Hashing and Large Scale Image Search

FAST Exam

Indexing and Hashing

Fast Trie Data Structures

Advanced Catalog Use

Welcome to the Minnesota SharePoint User Group

Search Engine Technology

Tabu Search

Search Patterns

Evaluating search engines

Index Structures

Fast Image Search

Graph Traversal

Search

Information Retrieval and Search Engines

Fast Regression Algorithms Using Spectral Graph Theory

Board Review- Neuromuscular Disorders

Search Engine Optimization (SEO)

Benefits of booking your hotel online in advance

2016 Marketing Blog Post Template