100 likes | 216 Views
Heavy-Tailed Distribution and Multi-Keyword Queries. Surajit Chaudhuri , Kenneth Church, Arnd Christian K ö nig, Liying Sui Microsoft Corporation SIGIR 2007 2008. 07. 31. Summarized by JongHeum Yeon , IDS Lab., Seoul National University. INTRODUCTION.
E N D
Heavy-Tailed Distribution and Multi-Keyword Queries SurajitChaudhuri, Kenneth Church, Arnd Christian K ö nig, Liying Sui Microsoft Corporation SIGIR 2007 2008. 07. 31. Summarized by JongHeumYeon, IDS Lab., Seoul National University
INTRODUCTION • Inverted Index in Information Retrieval • T0 = "it is what it is“, T1 = "what is it“, T2 = "it is a banana“ • "a": {2}, "banana": {2}, "is": {0, 1, 2}, "it": {0, 1, 2}, "what": {0, 1} • Search “what”, “is”, “it” • {0,1} ∩{0,1,2} ∩{0,1,2} = {0,1} • Some queries require costly deep traversal into long lists in web-sites(Amazon, eBay, …) with large catalogs of products • The challenge is to reduce the worst-case overhead required to process arbitrary keyword queries
Motivating Scenario • More frequent terms have relatively long inverted lists • Intersections of long inverted indexes are very slow relative to other queries • Figure • 20 million products • Frequency : F(>900K)-M(50K)-L(<1K)
Problem Statement Given a document collection, propose a set of indexes to materialize Time for intersecting keywords does not exceed a given threshold Δ Additional indexes should not be larger than k(small factor) times the size of the original inverted index
INDEX STRUCTURE AND USAGE • Notation • Query Q • words(Q) = {w1, … , wl} • kmax : maximum number of terms in query • γ : global vocabulary • π : global ordering • Given keyword-combination C = {w1, … , wl}, sort words by global ordering for avoiding permutations of keyword-combination • size(Q) : number of items(=document) whose text contains all keyword of a query Q • size(w) : single word w, number of documents containing w • |Q| : number of keywords a query Q contains
Cost Model • Cost • Disk seeks to the beginning of posting lists + • Scanning postings • Unit of cost : scanning a single posting in an inverted index • Δ : Cost bound
Processing Strategies • Execution Strategies • ID-intersection • Retrieves all inverted indexes of the queried keywords and intersects them • |Q| seeks accesses to disk, reading their contents entirely • Post-filtering • When wi in Q is very rare, • Reading text of wi by inverted index, then verifying the remaining keyword constraints using text
Index Structure materialize combinations of frequent keywords and a small fraction of them For each vocabulary items w, a list of all keyword combinations containing w for which they have materialized the corresponding inverted index
Query Processing Query Q = {w1, … , wl} Q contains rare keyword : post-filtering strategy Otherwise : retrieve all match-list entries
EXPERIMENTS • Evaluation of Query Cost • Materialized the index structure : 10K frequent words • Kmax = 4, CostSeek = 1000 • Δ : cost of scanning 20% of the number of postings • Speed-ups • 18x (2 keywords) • 14x (4 keywords) • Evaluation of Index Sizes • 899M postings • No additional indexes for keywords occurring in less than 50 documents • 141K keywords for indexing • Multi-keyword index structures contained 734M postings • Accuracy of Intersection-size Estimation • Match list covers 99.3%