260 likes | 612 Views
Flood Little, Cache More: Effective Result-Reuse in P2P IR Systems Christian Zimmer , Srikanta Bedathur, Gerhard Weikum Max-Planck Institute for Informatics, Saarbrücken, Germany http://www.mpi-inf.mpg.de Outline of the Talk Motivation System Architecture Caching Framework
E N D
Flood Little, Cache More: Effective Result-Reuse in P2P IR Systems Christian Zimmer, Srikanta Bedathur, Gerhard Weikum Max-Planck Institute for Informatics, Saarbrücken, Germany http://www.mpi-inf.mpg.de DASFAA Conference 2008
Outline of the Talk • Motivation • System Architecture • Caching Framework • Exact Caching (EC) • Approximate Caching (AC) • Experimental Evaluation • Conclusions & Open Issues DASFAA Conference 2008
Motivation Basics • High Potential of P2P-based Information Retrieval (P2P IR) systems: • benefits in general: scalable, efficient, resilient to failures and dynamics, democratic, privacy preserving, and resilient to authoritarian controls • benefits from intellectual input of users: click streams, query logs, bookmarks, etc. • Performance Challenges: • providing high quality results (recall & precision) • enabling high scalability (number of participating peers & huge amounts of data). • unreliable networks: slow response times, intermittent loss of good results • extra load on network: many peers for good recall DASFAA Conference 2008
Motivation (con't) Caching of Results • Traditional performance booster (using previous query executions to help in the future) • Remember popular items to avoid computing / fetching Typical Issues • What to Cache? • value of cached items • inverted lists / full results / partial results • Where to Cache? • on querying peers, every node along lookup path (UIC), spread to neighbors (DiCAS), on good nodes (View Trees) • How much to Cache? • buffer size • When to drop from Cache? • buffering policy • Goals of Caching? • response time improvements, query result-quality improvement DASFAA Conference 2008
P1 P2 P3 P4 P5 P7 P8 P6 D(c) D(a) D(b) System Architecture Maintaining Metadata • Autonomous peers with local index (local search engine) • Distributed global directory layered on top of distributed hash table (DHT) • DHT partitions term space such that each peer is responsible for subset of terms • Peers distribute per-term summaries (Posts) to global directory (size of the index, number of documents containing this term, etc.) • Directory manages aggregated statistical information in compact form Minerva Search Architecture DASFAA Conference 2008
P1 P2 P3 P4 P5 P7 P8 P6 D(c) D(a) D(b) System Architecture Query Execution • Multi-term query a b c • Peerlist requests to retrieve metadata from directory(metadata retrieval) • Compute most promising peers for complete query (e.g., CORI, DTF) • Complete query forwarded to these peers executing query locally(local result retrieval) • Local results returned and merged to global query result Minerva Search Architecture query: a b c DASFAA Conference 2008
Caching Framework Main Goals • Caching for result-quality improvement • Integration of result caching with query routing (reduces message traffic) • Cache placement for seamless reuse • Aggressive result-reuse under certain conditions Where and What to Cache? • Potential locations for caching: • Query initiator or additional overlays: limited utility to network: • Directory: choose one directory peer involved in query execution using deterministic scheme (avoids load balancing concerns) • Caching full results: • Metadata of results (URL, statistics, etc.) • Set of source peers contributing to cached results DASFAA Conference 2008
Caching Framework (con't) Extending Query Execution • Query Routing: • initiating peer sends full query to all directory peers responsible for query terms • directory checks availability of cached result and if available returns it to initiator • Adding / Updating Cache • query initiator computes full query result and cached result for top-k items • initiator determines directory peer responsible for maintaining cached result • directory peer incorporates received cache result in its cache Two Caching Strategies based on Caching Framework • Exact Caching (EC): • P2P counterpart of traditional result caching • Approximate Caching (AC): • aggressively reuse cached results of query subsets DASFAA Conference 2008
P1 P2 P3 P4 P5 P7 P8 P6 D(c) D(a) D(b) Exact Caching (EC) Main Property • Only used if stored result generated by exactly same query Caching Approach • After query execution: cached results stored at directory (by selecting one directory peer) • Request for a b c by another peer • Metadata retrieval returns in addition cached result • Initiator satisfied: saves additional communication at same result-quality • Improving: local result retrieval from additional peers • Updating cached result query: a b c query: a b c DASFAA Conference 2008
Approximate Caching (AC) Limitation of Exact Caching • EC only applicable when exact query was executed before • Approximate Caching tries to overcome this issue if cached result for complete query is not available Caching Approach • Aggressively retrieve and combine cached results of subsets of requested query to approximate full query • Avoidslocal result retrieval • Metadata retrieval: • querying peer requests peerlists for all query terms • directory peers return all existing maximal cached results for subsets of query term set • querying peer only considers cached results for maximal subqueries received from directory • By Design: • directory peers for query terms responsible for all possible subqueries • if AC strategy not satisfying, metadata retrieval already done DASFAA Conference 2008
a c d a c d P6 P1 P8 P2 P3 P4 P5 P7 a c d c c c b c d b c d b c b c d b d Approximate Caching (AC) (con't) An Example • Request for a b c d • No cached result for full query, but directory stores cached results for subqueries • Metadata retrievalreturns in addition all cached results for maximal subqueries • To combine subquery results, querying peer only considers maximal ones Unsatisfactory Approximate Result • Querying peer retrieves local results from top-ranked peers for full query query: a b c d D(d) D(c) D(a) D(b) DASFAA Conference 2008
Approximate Caching (AC) (con't) How to Combine Cached Results of Different Subqueries • Having determined document set contained in all cached results for maximal subqueries, documents need to be ranked for approximate result for full query • Consider document scores scored,p,q from cached results for document d as local result of peerp concerning (sub-)query q Final Score Computation • To rank the document set and get approximate result • scored = maxp,q (|q| scored,p,q) • takes different query sizes into account: longer queries more selective and approximate better full query • more than one cached result can include a document: only consider maximal score DASFAA Conference 2008
Experimental Evaluation Experimental Setup • P2P IR Benchmark recently proposed for P2P system evaluation [ExpDB 2006] • > 800,000 documents from Wikipedia • 99 Google Zeitgeist queries (1-3 query terms) • Documents distributed to 1,000 peers (with controlled overlap) • In addition: AOL query-log (real-world log with time ordering) • Result retrieval returns top-25 local results per peer; final result obtains top-50 documents for full query Measurements • Relative Recall: fraction of ideal result documents included in results of P2P query processing • Ideal results as top-50 result documents of centralized query execution including combined document collection • Network Resource Consumption: total network traffic incurred during query processing • number of messages transfered across network • number of communication rounds DASFAA Conference 2008
Experimental Evaluation (con't) I. Improving Recall with Exact Caching (EC) • Focus on query result improvement by asking additional peers • Updated cached result stored in directory • Initial query processing disseminates query to 5% of network; each improvement step considers up to 5% additional network peers • Relative recall averaged over all 99 Zeitgeist queries DASFAA Conference 2008
Experimental Evaluation (con't) DASFAA Conference 2008
Experimental Evaluation (con't) II. Cache Management Strategies • Assumes bounded cache space at directory peers such that cache management policy influences recall for Exact Caching strategy • Cache at directory peer restricted to three cached results each • Synthetic query workload from Zeitgeist queries: • all possible 9180 one- and two-term queries from single query terms • assuming a power law distribution (total of 102,158 requests) • Cache replacement strategies: LFU, LRU, FIFO, RAN, UNL (upper bound), and NOC (lower bound) • Measures: overall relative recall and cache hit ratio DASFAA Conference 2008
Experimental Evaluation (con't) DASFAA Conference 2008
Experimental Evaluation (con't) III. Cost Analysis • Network cost analysis: per query network traffic, number of messages, and communication rounds in three scenarios: • No Caching (NC): standard query processing (5% of network) • EC Single-Step (EC-SS): Exact Caching without query result improvement • EC Multi-Step (EC-MS): Exact Caching with query result improvement up to 50% of network in 5% steps • Details (different phases, assumptions etc.) see paper! DASFAA Conference 2008
Experimental Evaluation (con't) NC No Caching EC-SS EC Single -Step EC-MS EC Multi-Step average relative recall 0.32 0.32 0.71 (+122%) network traffic (per query) 55.3 Kbytes 23.1 Kbytes (-58.2%) 41.0 Kbytes (-25.9%) messages (per query) 106 25.7 (-75.8%) 61.4 (-42.1%) response time (rounds) 2 1.19 (-40.3%) 1.60 (-20.0%) DASFAA Conference 2008
Experimental Evaluation (con't) IV. Approximate Caching Scenarios • 4000 generated random 3- and 4-term queries from benchmark query set • Comparison of 5 scenarios against standard query routing (SQR): • Effectiveness of AC in terms of relative recall depending on number of peers contributed to cached subquery result DASFAA Conference 2008
Experimental Evaluation (con't) DASFAA Conference 2008
Experimental Evaluation (con't) V. Real-World Query-Log • Using AOL query-log to have time-order of queries: overall 57,344 requests with 39,640 unique queries • Combination of EC and AC • Results: ~25% hit rate & recall imrovement from 0.45 to 0.52 DASFAA Conference 2008
Experimental Evaluation (con't) VI. Impact of Churn • On benefits of EC-MS • Different churn rates: fraction of peers leave network DASFAA Conference 2008
Experimental Evaluation (con't) DASFAA Conference 2008
Conclusions & Open Issues Conclusions • Introduced simple, yet effective, caching framework to take advantage of previous work of peers in P2P network • Exact Caching (EC): • possibility to improve recall - or to reduce response time / network cost • experiments used Wikipedia benchmark and real-world query-log • investigated various cache replacementstrategies and considered churn in P2P • Approximate Caching (AC): • aggressive reuse of cached results of subqueries - if full query results not available • demands on existing cached results for satisfying outcomes Open Issues • Proactive Caching (anticipate interesting queries, e.g., from existing logs) • Maintaining cache freshness (new or better results are available) • Replication (metadata and/or documents) DASFAA Conference 2008
Thank You For Your Attention! Questions or Comments? DASFAA Conference 2008