Heavy-Tailed Distribution and Multi-Keyword Queries

Heavy-Tailed Distribution and Multi-Keyword Queries SurajitChaudhuri, Kenneth Church, Arnd Christian K ö nig, Liying Sui Microsoft Corporation SIGIR 2007 2008. 07. 31. Summarized by JongHeumYeon, IDS Lab., Seoul National University

INTRODUCTION • Inverted Index in Information Retrieval • T0 = "it is what it is“, T1 = "what is it“, T2 = "it is a banana“ • "a": {2}, "banana": {2}, "is": {0, 1, 2}, "it": {0, 1, 2}, "what": {0, 1} • Search “what”, “is”, “it” • {0,1} ∩{0,1,2} ∩{0,1,2} = {0,1} • Some queries require costly deep traversal into long lists in web-sites(Amazon, eBay, …) with large catalogs of products • The challenge is to reduce the worst-case overhead required to process arbitrary keyword queries

Motivating Scenario • More frequent terms have relatively long inverted lists • Intersections of long inverted indexes are very slow relative to other queries • Figure • 20 million products • Frequency : F(>900K)-M(50K)-L(<1K)

Problem Statement Given a document collection, propose a set of indexes to materialize Time for intersecting keywords does not exceed a given threshold Δ Additional indexes should not be larger than k(small factor) times the size of the original inverted index

INDEX STRUCTURE AND USAGE • Notation • Query Q • words(Q) = {w1, … , wl} • kmax : maximum number of terms in query • γ : global vocabulary • π : global ordering • Given keyword-combination C = {w1, … , wl}, sort words by global ordering for avoiding permutations of keyword-combination • size(Q) : number of items(=document) whose text contains all keyword of a query Q • size(w) : single word w, number of documents containing w • |Q| : number of keywords a query Q contains

Cost Model • Cost • Disk seeks to the beginning of posting lists + • Scanning postings • Unit of cost : scanning a single posting in an inverted index • Δ : Cost bound

Processing Strategies • Execution Strategies • ID-intersection • Retrieves all inverted indexes of the queried keywords and intersects them • |Q| seeks accesses to disk, reading their contents entirely • Post-filtering • When wi in Q is very rare, • Reading text of wi by inverted index, then verifying the remaining keyword constraints using text

Index Structure materialize combinations of frequent keywords and a small fraction of them For each vocabulary items w, a list of all keyword combinations containing w for which they have materialized the corresponding inverted index

Query Processing Query Q = {w1, … , wl} Q contains rare keyword : post-filtering strategy Otherwise : retrieve all match-list entries

EXPERIMENTS • Evaluation of Query Cost • Materialized the index structure : 10K frequent words • Kmax = 4, CostSeek = 1000 • Δ : cost of scanning 20% of the number of postings • Speed-ups • 18x (2 keywords) • 14x (4 keywords) • Evaluation of Index Sizes • 899M postings • No additional indexes for keywords occurring in less than 50 documents • 141K keywords for indexing • Multi-keyword index structures contained 734M postings • Accuracy of Intersection-size Estimation • Match list covers 99.3%

Heavy-Tailed Distribution and Multi-Keyword Queries

Heavy-Tailed Distribution and Multi-Keyword Queries

Presentation Transcript

White-tailed Deer

DIGITAL Multi-Platform Distribution

Lecture 9 Preview: One-Tailed Tests, Two-Tailed Tests, and Logarithms

Supporting Location-Based Approximate-Keyword Queries

Ring-Tailed Lemur

Collective Spatial Keyword Queries: A Distance Owner-Driven Approach

XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

Ring Tailed Lemur

Red Tailed Hawk

Multi-table queries (JOIN) 2 tables

Multi-resolution Resource Behavior Queries Using Wavelets

Mercury: Supporting Scalable Multi-Attribute Range Queries

Keyword

Downloading Textual Hidden-Web Content Through Keyword Queries

Processing XML Keyword Search by Constructing Effective Structured Queries

See-To-Retrieve: Efficient Processing of Spatio-Visual Keyword Queries

Keyword Search and Keyword Selection

Caching Multi-dimensional Queries Using Chunks

Coherent Target Detection in Heavy-Tailed Compound-Gaussian Clutter: Survey and New Results

XDS Stored Query – Multi-patient Queries

Multi-dimensional Queries in P2P Systems

Formal Models of Heavy-Tailed Behavior in Combinatorial Search