230 likes | 309 Views
Improved Techniques for Result Caching in Web Search Engines. Qinqing Gan Torsten Suel. Presenter: Arghyadip ● Konark. Summary:. Result caching in web search engines Query Result Caching of search engines to improve the query processing performance.
E N D
Improved Techniques for Result Caching in Web Search Engines Qinqing Gan Torsten Suel Presenter: Arghyadip ● Konark
Summary: Result caching in web search engines • Query Result Caching of search engines to improve the query processing performance. • To increase the effective throughput of the entire search engine system. • Discussion of various weighted ,un-weighted and hybrid query result caching techniques. • Performance Evaluation.
Query Processing • Main challenge for query processing is the significant size of the index data for a query • Need to optimize to scale with users and data • Caching is one of such optimizations • Result caching: has query occurred before? • List caching: has index data for term been accessed before?
Related Work • Number of subsequent papers on result caching: (Cache Hit only) • Baeza-Yates et al. (SPIRE 2003, 2007, SIGIR 2003) • Fagni et al. (TOIS 2006) • Lempel/Moran (WWW 2003) • Saraiva et al. (SIGIR 2001) • Xie/Hallaron (Infocom 2002) • Fagni el al. proposes hybrid methods that combine a dynamic cache with a more static cache • Baeze-Yates et al. (Spire 2007) use some features for cache admission policy
Caching Basics LRU: least recently used LFU: least frequently used Can be implemented using basic data structures score defined as the time since last occurrence of the same query in LRU, or the frequency of a query in LFU. Evict query with smallest score Recency (LRU) vs. frequency (LFU) Various hybrids: Combines two or more.
SDC (Static and Dynamic Caching) Fagni et al. (TOIS 2006) LFU LRU Alpha = 0.7
Characteristics of Queries(AOL Query Log) • Query frequencies follow Zipf distribution • While a few queries are quite frequent, most queries occur only once or a few times Double Logarithmic Scale
Characteristics of Queries • Query traces exhibit some amount of burstiness, i.e., most of the queries occur only once or twice • A significant part of this burstiness is due to the same user reissuing a query to the engine. • With an assumed query arrival rate at 132 Queries per minute • Most queries repeat within few minutes/hour
Only Cache Hit? • Query Result Fails. • Frequent Admission and Eviction Occurs.
Ideology: • Study result caching as a weighted caching problem - Hit ratio - Cost saving • Hybrid algorithms for weighted caching
Weighted Caching • Assume all cache entries have same size. • Standard caching: all entries also same cost • Weighted caching: different costs. • Result caching: some queries more expensive to recompute than others • In fact, costs highly skewed • Should keep expensive results longer
Weighted Caching Algorithms • LFU_w: evict entry with smallest value of past frequency * cost (weighted version on LFU) • Landlord • On insertion, give entry a deadline equal to its cost • Evict entry with smallest deadline, and deduct this deadline from all other deadlines in the cache Weighed version of LFU (Young, Cao/Irani 1998) • SDC_w: Combination of LFU_w and Landlord.
New Hybrid Algorithms • SDC • lru_lfu • landlord_lfu_w
Weighted Caching and Power Laws • Problem with weighted caching with high skew • Suppose q_1 has occurred once and has cost 10, and q_2 has occurred 10 times and has cost 1 • LFU_w gives same priority is that right? • Lottery: • Multiple rounds, one winner per round • Some people buy more tickets than others • But each person buys same number each week • Given past history, guess future winners • Suppose ticket sales are Zipfian
Weighted Caching and Power Laws • Compare: smoothing techniques in language models • Three solutions: • Good-Turing estimator • Estimator derived from power law • Pragmatic: fit correction factors from real data • Last solution subsumes others
Weighted Zipfian Caching E.g, in LFU_w, Priority score = cost * frequency * g()
Dataset and Evaluations • 2006 AOL query log with 36 million queries • 4GB of Data Collected as HTML Pages from Quora • Lemur Search Engine has no support for Result Caching • Plan to Develop Weighted LRU, LFU and SDC Result Caching on top of Lemur • Compare the performance with different weights assigned to Hit Ratio and Load over all the above caching variants • Evaluate which weight metric works best