170 likes | 353 Views
Search Engine Caching. Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001 http://doi.acm.org.ezproxy.lib.vt.edu:8080/10.1145/383952.383959
E N D
Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001 http://doi.acm.org.ezproxy.lib.vt.edu:8080/10.1145/383952.383959 Predictive Caching and Prefetching of Query Results in Search Engines, Ronny Lempel and Shlomo Moran, September 2001 http://doi.acm.org.ezproxy.lib.vt.edu:8080/10.1145/775152.775156 Presented by Adam "So, is this gonna be on the test?" Edelman
The Problem • The User: "I want my results now!" • But... • Over 4 billion web pages • Over 1 million queries per minute • How do we keep response times down as the web grows?
Search Engine Statistics • 63.7% of the search phrases appear only once in the billion query log • The 25 most popular queries in the log account for 1.5% of the submissions • Considerable time and processing power can be saved through well implemented caching
Search Engine Statistics • 58% of the users view only the first page of results (the top-10 results) • No more than 12% of users browse through more than 3 result pages. • We do not need to cache large result sets for a given query
What do we Cache? • 36% of all queries have been retrieved before • Can we apply caching even if the query does not exactly match any previous query?
What do we Cache? • Saraiva et. al propose a two level cache • In addition to caching query results, we also cache inverted lists for popular terms
Query Cache Implementation • Store only the first 50 references per query • ~25KB per query • Query logs show that the miss ratios do not drastically improve after query result cache exceeds 10 MB
Inverted List Cache Implementation • For this data set 50-75% of inverted lists contain documents where term appears only once • Use 4KB inverted list size per term • More work needs to be done • Asymptotic behavior is apparent after cache exceeds 200MB • Use 250MB for IL Cache
Two-Level Cache Implementation • Combine previous two caches • 270MB total cache • Accounts for only 6.5% of overall index size • Tested over a log of 100K queries to TodoBR
Two-Level Cache Results • Compared to caches of 270MB for only query results, only inverted lists and no cache • Queries processed reduced by 62% • 21% increase compared to only query result cache • Page fetches from the database reduced 95% • 3% increase compared to only inverted list cache
Two-Level Cache Results • For more than 20 queries per second two-level cache is 20% disk reads of no cache • Two-level cache can handle 64 queries per second against 22 per second with no cache
How do we cache? • Saraiva et al use a least recently used (LRU) replacement policy for cache maintenance • Users search in sessions, the next query will probably be related to the previous query • Can we use this to improve caching?
Probability Driven Cache (PDC) • Lempel and Moran propose a cache based on the probability of a page being requested
Page Least Recently Used (PLRU) • Allocate a page queue that can accommodate a certain number of result pages • When the queue is full and a new page needs to be cached, the least recently used page is removed from the cach • Achieves hit ratios around 30% for warm, large caches
Page Segmented LRU (PSLRU) • Maintains two LRU segments, a protected segment and a probationary segment • Pages are first placed in the probationary segment, if requested again they are moved to the protected segment • Pages evicted from the protected segment are moved to the probationary segment • Pages evicted from the probationary segment are removed from the cache • Consistently outperforms PLRU although difference is very small
Topic LRU (TLRU) • Let t(q) denote the topic of the query q • After the cache is warm, any cached result page of t(q) is moved to the tail of the queue. • Each topic’s pages will reside contiguously in the queue
Topic SLRU (TSLRU) • All pages are initially inserted in the probationary segment • In addition to promoting pages from probationary to protected, we also promote all pages of t(q)