290 likes | 469 Views
Gossip-based Search Selection in Hybrid Peer-to-Peer Networks. M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo, Waterloo, ON, Canada matei@matei.ca, keshav@uwaterloo.ca IPTPS 2006, Feb 28th 2006. The Search Problem.
E N D
Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo, Waterloo, ON, Canada matei@matei.ca, keshav@uwaterloo.ca IPTPS 2006, Feb 28th 2006
The Search Problem • Decentralized system of nodes, each of which stores copies of documents • Keyword-based search • Each document is identified by a set of keywords (e.g. song title) • Queries return lists of documents whose keyword sets are supersets of the query keywords (“AND queries”) • Example • Song: “Here Comes the Sun” • keywords: “Here”, “Comes”, “The”, “Sun” • Query: “Here” AND “Sun” • Responses: “Here Comes the Sun”, “The Sun is Here”
Metrics • Success rate • fraction of queries that return a result, conditional on a result being available • Number of results found • no more than a desired maximum Rmax • Response time • for first result, and for Rmaxth result • Bandwidth cost • includes costs of index creation, query propagation, and to fetch result(s)
Key Workload Characteristics • Document popularities follow a Zipfian distribution • Some documents are more widely copied than others • Are also requested more often • Some nodes have much faster connections and much longer connection durations than others
So… • Retrieve popular documents with least work • Offload work to better-connected and longer-lived peers How can we do that?
Hybrid P2P network [Loo, IPTPS 2004] Bootstrap Nodes DHT Flood queries for popular documents Use DHT for rare documents Only publish rare documents to DHT index Ultrapeers Peers
How to know document popularity? • PIERSearch uses • Observations of • result size history • keyword frequency • keyword pair frequency • Sampling of neighboring nodes • These are all local • Global knowledge is better
More on global knowledge • Want histogram of document popularity • i.e. number of ultrapeers that index a document • we only care about popular documents, so can truncate the tail • On getting a query, sum histogram values for all matching document titles and divide by number of ultrapeers • If this exceeds threshold, then flood, else use DHT* * modulo rare documents with common keywords, see paper
Example • Assume 100 ultrapeers and only two documents • Suppose title ‘Here comes the Sun’ has count 15 (15 ultrapeers index it) and `You are my Sun’ has count 2 • Query ‘Sun’ has sum 15+2/100 = 0.17 • Query ‘Are My’ has sum 2/100 = 0.02 • If threshold is 0.05, then first query is flooded and for second, we use a DHT
How to compute the histogram? • Central server • Centralizes load and introduces single point of failure • Compute on induced tree • brittle to failures • Gossip • pick random node and exchange partial histograms • can result in double counting
A: a, b B: a, c C: a, d Double counting problem a:5 b:1 c:1 d:1 a:2 b:1 c:1 a:2 b:1 c:1 a:3 b:1 c:1 d:1
Avoiding double couting • When an ultrapeer indexes a document title it hasn’t indexed already, it tosses a coin up to k times and counts the number of heads it sees before the first tail = CT • Gossip CT values for all titles with other ultrapeers to compute maxCT • because max is an extremal value, no double counting • (Flajolet-Martin) Count of the number of ultrapeers with the document is roughly 2maxCT • Example • 1000 nodes • Chances are good that one will see 10 consecutive heads • It gossips ‘10’
Approximate histograms • Use coin-flipping trick for each document • Note that there can be up to 50% error • Gossip partial histograms • Concatenate histograms • Truncate low-count documents
What about the threshold? • If chosen too low, flood too often! • If chosen too high, flood too rarely! • Threshold is time dependent and load dependent • No easy way to choose it
Adaptive thresholding • Associate utility with the performance of a query • Threshold should maximize utility • For some queries, use both flooding and DHT and compare utilities • This will tell us how to move the threshold in the future
Evaluation • Built an event-driven simulator for peer-to-peer search in generic peer-to-peer network architectures, in Java. • Simulates each query, response and document download. • Uses user lifetime and bandwidth distributions observed in real systems. • Generates random exact queries based on fetch-at-most-once model (Zipfian with flattened head) • can also use traces of queries from real systems.
Parameters • 3 peers join every 4 seconds • Each enters with an average of 20 documents, randomly chosen from a dataset of 20,000 unique documents • Peers emit queries on average once every 300 seconds, requesting at most 25 results • Zipf parameter of 1.0. • 1.7 million queries over a 22 hour period
Simulation stability • Stable population achieved at 20,000 seconds • Variance of all results under 5% and removed for clarity
Trace-based simulation • Trace of 50 ultrapeers for 3 hours on Sunday October 12, 2003 • ~230,000 distinct queries • ~200,000 distinct keywords • ~672,000 distinct documents
Conclusions • Gossip is an effective way to compute global state • Utility functions provide simple ‘knobs’ to control performance and balance competing objectives • Adaptive algorithms (threshold selection and flooding) reduce the need for external management and “magic constants” • Giving hybrid ultrapeers access to global state reduces overhead by a factor of about two
Questions? ? ? ?
Simulator Speedup • Fast I/O routines • Java creates temporary objects during string concatenation. Custom, large StringBuffer for string concatenation greatly improves performance. • Batch database uploads • prepared statements turn out to be much less efficient than importing a table from a tab-separated text file. • Avoid keyword search for exact queries • Can simulate 20 hours with a population of 7000 users (~2,300,000 queries) in about 20 minutes