The 31 st Annual International ACM SIGIR Conference Singapore, 21 July 2008

ResIn: A Combination of Results Caching and Index Pruning for High-performance Web Search Engines The 31st Annual International ACM SIGIR Conference Singapore, 21 July 2008

Motivation • Caching – crucial for WSE to save resources • Results caching: • Is efficient with real queries • But its hit rate is limited due to singletons • How to increase the hit-rate further? – index pruning

Contents • ResIn architecture • Original query stream vs. query stream after the results cache (misses) • Static pruned index: • Term pruning • Document pruning • A combination of both • Conclusion

Main Index Term cache Term cache Term cache Back Back Back end end end query result query Front Broker end result ResIn architecture • We study Results Caching and Index Pruning together • … to reduce latency and load on back-end servers Query processing: 1. from the main index Top results Top results Top results query query query

Main Index Term cache Term cache Term cache Back Back Back end end end query result query miss Front Results Broker end cache hit result ResIn architecture • We study Results Caching and Index Pruning together • … to reduce latency and load on back-end servers Query processing: 2. from the results cache query

Main Index Term cache Term cache Term cache Back Back Back end end end query result miss Front Pruned Pruned Broker end index index hit result ResIn architecture • We study Results Caching and Index Pruning together • … to reduce latency and load on back-end servers Query processing: 3. from the pruned index query query miss Results cache hit

Original query stream (all queries) vs. query stream after the results cache (misses)

All queries vs. Misses: Experimental setup • Real query log to test results cache and generate a “miss-log”: Original query log all queries “Miss-log” misses Q1: britney spears Q1: britney spears Q2: sigir 2007 Q2: sigir 2007 Results cache (LRU) miss Q3: britney spears Q4: sigir 2008 185M queries from yahoo.co.uk Q4: sigir 2008 hit Q3: britney spears

All queries vs. Misses: Number of terms in a query • Average number of terms for all queries = 2.4 • Most single term queries are hits in the results cache • Queries with many terms are unlikely to be hits , for misses = 3.2

All queries vs. Misses: Query result size distribution • Randomly selected 2000 queries from all queries and misses: • Avg. result size for misses is ~100 times smaller than for all queries • Approx. half of the misses returns less than 5000 results – SMALL! • Similar results with a “small” UK document collection (78M)

All queries vs. Misses: Term popularity distribution • Each point -> avg. popularity of 1000 consecutive terms • The order of terms for misses is the same as for all queries • Terms which were popular before the results cache remain popular after Log sizes: 185M – all queries, 41M - misses

Static index pruning

Static pruned index • Smaller version of the main index, returns: • the top-k response that is the same as the main index’s, or • a miss otherwise. • Assumes Boolean query processing • Types of pruning: • Term pruning – full posting lists for selected terms • Document pruning – truncated posting lists • Term+Document pruning – combination of both Full index Term pruning Document pruning T+D pruning t1 t1 t1 t1 t2 t2 t2 t2 t3 t3 t3 t3 t4 t4 t4 t4 Posting list

Term Pruning: Performance • Term pruning based on profit(t)=popularity(t)/df(t) • Answers a query if all query terms are in the pruned index • Performs well for all queries • For misses as well: e.g., can process almost 50% of the queries with 25% of the index UK document collection, 78M documents:

Result Caching + Term Pruning • Results caching performance is independent of the collection size results cache capacity is up to 10% of the full index size

Term pruning: Frequent terms in misses • MinDF (df of the least frequent query term) correlates to the result size • MaxDF (df of the most frequent query term) is high for most of the misses • Many misses contain at least one frequent term • => the term pruned index has to include large posting lists MinDF Gleb Flavio Vassilis Ricardo •••••••••• •••••••••••••••••• ••••••••••••• ••••••••••••••••••••••••••••• MaxDF

Document pruning • Based on Fagin’s top-k intersection algorithm • Keeps postings with high scores only: • Sufficient to compute top-k results for some queries • Determining correctness of the result requires computing of a scoring threshold – LATENCY! Top-2 results: t1 D1 D2 t2 Score threshold: s(D2,t1)+s(D1,t2)+s(D2,t3) t3 Posting list, sorted by score

Document pruning: Experimental setup • Scoring function: • pr(d) – query independent score of the document d (pagerank) • ω, k – normalization constants: • ω=[0,10,20] • k=1 • We try different values of PLLmax – maximum Posting List Length and choose the one that maximizes the hit rate • We only look at the upper bound for the hit rate: Whether the original top-10 results found in the top portions of all PLs?

Document pruning: performance • Doc. pruning needs high pagerank weights • It performs better for All queries than for Misses

Term+Document pruning: performance • T+D pruning is the best but expensive (high latency) • profit2is better than profit1 • Improvement is marginal for misses unless the pagerank weight is very high

Conclusions • Results caching: • delivers good hit rates with a constant capacity • but hit rate is limited because of singletons • Index pruning: • has no limit on hit rate, • but the pruned index size grows with the doc. collection – more expensive • Static index pruning: addition to results caching, not replacement • Term pruning performs well for misses also =>“compatible” with results cache • Document pruning: all queries - OK, misses - only with high pagerank weights • Term+Document pruning slightly improves over document pruning Lesson learned: Important to consider the interaction between the components

Last slide Thank you Questions?

The 31 st Annual International ACM SIGIR Conference Singapore, 21 July 2008

The 31 st Annual International ACM SIGIR Conference Singapore, 21 July 2008

Presentation Transcript

July 31, 2008

Ecsite Annual Conference 2008, Budapest, 29-31 May 2008

31 st International conference

21 st Annual Conference

21 st Annual Conference

ACM SIGIR 2014 July 6-11, 2014 The 37 th Annual I nternational ACM SIGIR Conference

OEOC 21 st Annual Conference

July 21 st , 2014

CHEP02 – 31 st International Conference on High Energy Physics, Amsterdam, 24 – 31 July, 2002

21 st Annual Conference

21 st Annual IL Statewide APA Conference

21 st Annual IL Statewide APA Conference

21 st Annual IL Statewide APA Conference

21 st Annual IL Statewide APA Conference

Tom Taylor Chief Executive 31 st July 2008

31 July 2008

21 st Annual Conference

21 st Annual Conference

The 21 st International AIDS Conference Highlights

21 st Annual Conference

CHEP02 – 31 st International Conference on High Energy Physics, Amsterdam, 24 – 31 July, 2002

31 st Annual Ethnographic and Qualitative Research Conference