130 likes | 358 Views
P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer. Mariam John CSE 6392 06/27/2006. Why P2P Web Search.
E N D
P2P Content Search:Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE 6392 06/27/2006
Why P2P Web Search • Full-fledged web search is under the control of centralized search engines. • Growing concern about the world’s dependency on a few quasi-monopolistic search engines and their susceptibility to commercial interests, spam, censorship, etc. • P2P search engine might be more robust than centralized search as the demise of a single server or site is unlikely to paralyze the entire search system. • All this leads to postulation that “the Web should be given back to the people”.
Challenges: P2P web search likely to work? • P2P search system has two main resource contstraints: storage and bandwidth. • Distribute conceptually global keyword index across a DHT-style network. • From a query processing and IR viewpoint, one of the key issues is query routing (Given a query, to which other peers should the query be forwarded to get the top-k ranked result set). • This decision requires statistical information about the data contents in the network. It can be made fairly efficient by utilizing a DHT-based distributed directory.
Challenges • Efficiency of P2P query routing is only one side of the coin. How about quality of the search result? • Goal is to be as good as centralized search engines. • P2P approach faces the challenge that the index lists and statistical information that lead to good search results are scattered across the network.
System Architecture - Minerva • Is a fully operational distributed search engine consisting of autonomous peers where: • Each peer has a local document collection. Local data collection is indexed by inverted lists, one for each keyword or term. • Conceptually global but physically distributed directory which is layered on top of a Chord-style distributed hash table (DHT) manages aggregated information about the peers local knowledge in compact form. • Chord DHT partitions the term space such that each peer is responsible for the statistics and metadata of a randomized set subset of terms within the directory.
Directory Maintenance • In the publishing process, each peer distributes per-term summaries (Posts) of its local index to the global directory. • The DHT determines the peer responsible for this term and this peer maintains a PeerList of all posts for this term. • Employs proactive replication of directory information to ensure certain degree of replication.
Query Execution • A query with multiple terms is processed as follows: • Query is executed locally using the peer’s local index. • If the user considers this result unsatisfactory, the peer issues a PeerList request to the directory for looking up potentially promising peers for each query term separately. • Query is executed completely on each of the remote peer.
Query Routing • Most query routing techniques works well on disjoint data collection. • What happens when autonomous peers crawl the web independently of each other. • It results in overlap of information which may be indexed my peers.
Exploiting correlations in Queries • Directory information about term correlation can be exploited for query routing in several ways. • First method: • Treat correlated term combinations as keys for DHT based overlay networks • Query initiator can locate the responsible directory peer by simply hashing the key and using standard DHT lookup routing.
..cont’d • Directory entry directly provides the query initiator with the query-specific peerList that reflects the best peer for the entire query. • What happens if this peerList is too short? • Query initiator always has the fallback option of decomposing the query into its individual terms and retrieving peerLists for each term. • What is the problem with the above method?
..cont’d • We still collect peerLists of high correlation term combinations. • Look up the directory for each query term separately. • Whenever a directory peer has a good peerList for the entire query, this information is returned to the query initiator, together with the per-term peerList. • This doesn’t cause any additional communication costs and also provides the query initiator with the best available information on all individual terms as well as the entire query.
Conclusion • Research efforts in the area of P2P content search is driven by the desire to “give the Web back to the people”. • This paper has explored the theme of leveraging “power of users” in a P2P Web search engine. • Observing user and community and user behavior is one potential key towards better search result quality.
References • Mathias Bender, Sebastin Michel,Peter Triantafillou, Gerhard Weikum, and Christian Zimmer,”Minerva:Collaborative P2P Search”,In VLDB,2005. • Mathias Bender, Sebastin Michel,Peter Triantafillou, Gerhard Weikum, and Christian Zimmer,”P2P Content Search: Give the Web Back to the People”. • Mathias Bender, Sebastin Michel,Peter Triantafillou, Gerhard Weikum, and Christian Zimmer, “Improving Collection Selection with Overlap-Awareness,” In SIGIR, 2005.