180 likes | 260 Views
YouSearch – Searching a Network of Personal Webservers. Mayank Bawa, Roberto J. Bayardo Jr., Sridhar Rajagopalan, Eugene J. Shekita. Make it Fresh, Make it Quick – Searching a Network of Personal Webservers. Simon Pinkel. Talk Outline. Introduction Internals Performance & Tuning
E N D
YouSearch – Searching a Network of Personal Webservers Mayank Bawa, Roberto J. Bayardo Jr., Sridhar Rajagopalan, Eugene J. Shekita. Make it Fresh, Make it Quick – Searching a Network of Personal Webservers Simon Pinkel
Talk Outline • Introduction • Internals • Performance & Tuning • Conclusions & Future Work
1.1 Personal Webservers for Content sharing • the web? • Simplicity of www-protocols (e.g. HTTP) • Maturity of the Internet (e.g. DNS) • personal? • Commoditization of personal Computers • Ubiquity of the browser interface Examples: • Corporation (1.500 people within IBM use a personal webserver) • University (Saarbrücken: every Research Group maintains its own (personal) webserver to publish its papers)
1.2 Search Issues on Personal Webservers • Transience of the content • Search by Navigation ineffective: • Content is often poorly arranged • large fraction of diverse (non-HTML) data • typically many files are not reachable by links (crawl-based) Web Search Engines? • crawl frequency vs transience: • host is offline at crawl time • host is offline at query time • small life-cycles of documents Results are stale and incomplete
1.3 Problems with existing (P2P) file sharing systems • Kazaa/Gnutella: Query flooding YouSearch is intended to run on a corpus which does not support wide content replication • Napster: centralized index YouSearch is capable of indexing file name and content • Both use their proprietary protocol YouSearch is („at its core“) web-compatible
2.1 Overview • YouSearch consists of: • Peer Nodes that run YouSearch-enabled Webservers • Browsers that search YouSearch-enabled content • A light-weight, centralized Component called the „Registrar“, whose purpose is to store the network state Peer 2 Peer 6 Peer 4 Registrar Peer 1 Browser Peer 5 Peer 3
2.2 Indexing Registrar • The indexing process looks as follows: • The Inspector examines shared files • If necessary the Indexer updates the local disc-based index • the Summarizer sums up the content information: It obtains a list of Terms T from the Indexer and creates a corresponding bloom filter: • create k bit vectors V[1], ...,V[k] with length L, all bits set to 0 • Using k independent Hash Functions H[i]:Term {1,...,L}, map each Term t in T into k bit Vectors V[1], ...,V[k] • V[1], ...,V[k] are sent to the Registrar • the Registrar‘s Summary Manager aggregates V[1], ...,V[k]into structures that map bit positions to sets of peers • but what about hash conflicts? • introducing k independent Hash Functions with their respective bit vectors V[1], ...,V[k]s.t.: term t occurs at this peer iff for all i: V[i](H[i](t)) = 1 • The indexing process looks as follows: • The Inspector examines shared files • If necessary the Indexer updates the local disc-based index • the Summarizer sums up the content information: It obtains a list of Terms T from the Indexer and creates a corresponding bloom filter: • create bit vector V with length L, all bits set to 0 • Using Hash Function H:Term {1,...,L}, map each Term t in T into V • V is sent to the Registrar • the Registrar‘s Summary Manager aggregates V into a structure that maps bit positions to sets of peers • but what about hash conflicts? Summary Manager Registrar ... Peer Inspector Indexer Summarizer
2.3 Querying Registrar Query Manager Peer 2 • Bob asks Alice for „pdf group:YouSearchTeam“ • Alice transforms query into canonical form „{(keywords,{pdf}),(group,{YouSearchTeam})}“ • and sends it to the Registrar • The Query Manager computes the Set R of relevant Peers • and sends it back to Alice Peer 6 Web Interface Canonical Tx Browser (Bob) Peer 1 (Alice) Registrar Peer 4 Result Display Result Gatherer • Alice refines R if necessary(group, site), then contacts all Peers in R directly, • they issue the query on their local content and return the result back to Alice • while gathering results, Bob already receives results from Alice so Bob does not perceive latency Peer 5 Peer 3 = group YouSearchTeam
Query Query Result Result pdf <urls> ... ... 2.4 Caching Results Registrar Query Manager Everytime a global query is answered, the querying peer • caches the Url Set U • informs the Registrar, • which adds (query, IP-address) in the cache table • deletes Cache entries after a small lifetime(ttl) • informs the registrar again Peer 2 Peer 1 Peer 1 Peer 2 Peer 1 Browser Peer 1 Registrar When a global query is issued, • the registrar looks up its cache mappings • computes all caching peers • picks one at random • and sends its ip address to the querying peer • which gathers the cached result from this peer Registrar Peer 3 Query Manager Browser
2.5 Failure Management Alice Carol Carol The Registrar • periodically attempts to contact all peers • if peer Alice guesses that Carol is offline, • it informs the Registrar, • Carol moves up in the Registrar‘s check queue • if Carol does not respond to the Registrar, • it is removed from the network The Peers • also send messages to confirm their status • if the registrar answers „offline“, • the peer starts a new session Registrar
3.1.1 Time to answer Queries Seconds
3.1.3 Gains from Caching Experiment: • 25 queries were issued at one peer • then these 25 queries were repeated at different peers Real World Example: • Out of 1.500 logged Queries, 3.54% (53) were served from peer caches (ttl = 5min) • 31.31% (~470) were asked more than once
3.1.4 Load on Registrar • Consider: n peers send k bloom filters of size L(bits) every T seconds if their content has changed, and let f be the number of peers whose content has changed in a time interval of length T. average inbound traffic at registrar: f*n*k*L every T seconds • In the current implementation: n = 3, L = 65.536b, T = 300s, and let f = 20% • Assuming T1(1.54 Mb/s), with 20% network overhead n = (80% * 1.54 Mb/s * T) / (f*k*L) = 9.856 peers • T3(44.736 Mb/s) 286.310 peers
3.2 Tuning Example: • YouSearch is released with bloom filter size L = 512bit • now we discover that a significant fraction of peers have most of the 512 bits set increase L, but then every peer needs to adjust parameters like this (number of bloom filters, timeout durations, frequency to send bloom filters/check for updates on the index etc.), or install new software YouSearch‘s Proposal: The Tuning Manager • installed at each peer • reacts on changes pushed by a centralized component, the Administrator, and changes a parameter state file
4.1 Conclusions • P2P-hybrid architecture to • provide search on transient, rapidly evolving content • produce fast, fresh and complete results • lightweight central component • option of distribution on multiple hosts • small space/processing requirements
4.2 Future Work Desired Extensions: • Partitioning the content in public and private (only for authenticated peers/users) • including text snippets in the displayed result • maintaining cache results for popular queries • global ranking (only local ranking in current implementation)
Bibliography • Mayank Bawa, Roberto J. Bayardo Jr., Sridhar Rajagopalan, Eugene J. Shekita. Make it Fresh, Make it Quick – Searching a Network of Personal Webservers • B. Bloom. Space/Time Trade-Offs in Hash Coding with Allowable Errors • Andrei Broder, Michael Mitzenmacher. Network Applications of Bloom Filters: A Survey • D. Carmel, E. Amitay, M. Hersovici, Y. Mareek, Y. Petruschka, and A.Soffer. Juru at TREC 10 – Experiments with Index Pruning