Improved Techniques for Result Caching in Web Search Engines

Improved Techniques for Result Caching in Web Search Engines Qinqing Gan Torsten Suel Presenter: Arghyadip ● Konark

Summary: Result caching in web search engines • Query Result Caching of search engines to improve the query processing performance. • To increase the effective throughput of the entire search engine system. • Discussion of various weighted ,un-weighted and hybrid query result caching techniques. • Performance Evaluation.

Query Processing • Main challenge for query processing is the significant size of the index data for a query • Need to optimize to scale with users and data • Caching is one of such optimizations • Result caching: has query occurred before? • List caching: has index data for term been accessed before?

Query Co-ordinator

Related Work • Number of subsequent papers on result caching: (Cache Hit only) • Baeza-Yates et al. (SPIRE 2003, 2007, SIGIR 2003) • Fagni et al. (TOIS 2006) • Lempel/Moran (WWW 2003) • Saraiva et al. (SIGIR 2001) • Xie/Hallaron (Infocom 2002) • Fagni el al. proposes hybrid methods that combine a dynamic cache with a more static cache • Baeze-Yates et al. (Spire 2007) use some features for cache admission policy

Caching Basics LRU: least recently used LFU: least frequently used Can be implemented using basic data structures score defined as the time since last occurrence of the same query in LRU, or the frequency of a query in LFU. Evict query with smallest score Recency (LRU) vs. frequency (LFU) Various hybrids: Combines two or more.

SDC (Static and Dynamic Caching) Fagni et al. (TOIS 2006) LFU LRU Alpha = 0.7

Characteristics of Queries(AOL Query Log) • Query frequencies follow Zipf distribution • While a few queries are quite frequent, most queries occur only once or a few times Double Logarithmic Scale

Characteristics of Queries • Query traces exhibit some amount of burstiness, i.e., most of the queries occur only once or twice • A significant part of this burstiness is due to the same user reissuing a query to the engine. • With an assumed query arrival rate at 132 Queries per minute • Most queries repeat within few minutes/hour

Only Cache Hit? • Query Result Fails. • Frequent Admission and Eviction Occurs.

Ideology: • Study result caching as a weighted caching problem - Hit ratio - Cost saving • Hybrid algorithms for weighted caching

Weighted Caching • Assume all cache entries have same size. • Standard caching: all entries also same cost • Weighted caching: different costs. • Result caching: some queries more expensive to recompute than others • In fact, costs highly skewed • Should keep expensive results longer

Weighted Caching Algorithms • LFU_w: evict entry with smallest value of past frequency * cost (weighted version on LFU) • Landlord • On insertion, give entry a deadline equal to its cost • Evict entry with smallest deadline, and deduct this deadline from all other deadlines in the cache Weighed version of LFU (Young, Cao/Irani 1998) • SDC_w: Combination of LFU_w and Landlord.

Hit Ratio of Basic Algorithms

Cost Reduction

New Hybrid Algorithms • SDC • lru_lfu • landlord_lfu_w

Weighted Caching and Power Laws • Problem with weighted caching with high skew • Suppose q_1 has occurred once and has cost 10, and q_2 has occurred 10 times and has cost 1 • LFU_w gives same priority  is that right? • Lottery: • Multiple rounds, one winner per round • Some people buy more tickets than others • But each person buys same number each week • Given past history, guess future winners • Suppose ticket sales are Zipfian

Weighted Caching and Power Laws • Compare: smoothing techniques in language models • Three solutions: • Good-Turing estimator • Estimator derived from power law • Pragmatic: fit correction factors from real data • Last solution subsumes others

Weighted Zipfian Caching E.g, in LFU_w, Priority score = cost * frequency * g()

Hybrid Algorithms After Adding Correction

Dataset and Evaluations • 2006 AOL query log with 36 million queries • 4GB of Data Collected as HTML Pages from Quora • Lemur Search Engine has no support for Result Caching • Plan to Develop Weighted LRU, LFU and SDC Result Caching on top of Lemur • Compare the performance with different weights assigned to Hit Ratio and Load over all the above caching variants • Evaluate which weight metric works best

Evaluation Methodology

Questions?

Improved Techniques for Result Caching in Web Search Engines

Improved Techniques for Result Caching in Web Search Engines

Presentation Transcript

Search Engines for Semantic Web Knowledge

Personalized Ontologies for Web Search and Caching

Web Search Engines

Web Technologies Search Engines

Web Technologies Search Engines

Web search engines

Web Search Engines

Improved Techniques for Result Caching in Web Search Engines

Web Search Engines

Page Ranking Techniques In Search Engines

Web and Search Engines

Web Search Engines

Web search engines

Web Search Engines

Web Search Engines

Web Search Engines

Deep Web Search Engines

Web search engines

Web Search Engines