Result Merging in a Peer-to-Peer Web Search Engine MINERVA

Result Merging in a Peer-to-Peer Web Search Engine MINERVA Master thesis project Speaker: Sergey Chernov Supervisor: Prof. Dr. Gerhard Weikum Saarland university, Max Planck Institute for Computer Science, Database and Information Systems Group

Overview 1 Result merging problem in MINERVA system 4 Summary and future work 2 Selected merging strategies: GIDF, ICF, CORI, LM 3 Our approach: result merging with the preference-based language model

Problems of present Web Search Engines • Size of indexable Web: • Web is huge, it’s difficult to cover all • Timely re-crawls are required • Deep Web • Monopoly of Google: • Controls 80% of web search requests • Sites may be censored by engine • Make use of Peer-to-Peer technology: • Exploit previously unused CPU/memory/disk power • Keep up-to-date results for small portions of Web • Conquer Deep Web by specialized web crawlers

MINERVA project • MINERVA is a Peer-to-Peer Web search engine Representative statistics C1 Si C6 P1 P6 Peer with local search engine Pi S6 S1 Index on crawled pages P2 C2 Ci P5 C5 S2 S5 S3 Distributed directory based on Chord protocol S4 P4 P3 C3 C4 Chord ring

Query processing in distributed IR <<R1, R2, R3,>,q> <P,q> Selection <P’,q> Retrieval Merging RM RM ................... ................... ................... ................... P1 P1’ R1 ................... ................... ................... ................... P2 P3 P2’ R2 ................... ................... ................... ................... P4 P3’ R3 ................... ................... ................... ................... P5 q – query, P – set of peers, P’ – subset of peers most “relevant” for q Ri – ranked result list of documents from Pi, Rm – merged result list of documents P6

Naive merging approaches • How we can combine results from peers? • 1. Retrieve k documents with the highest similarity scores Problem: scores incomparable • 2.Take the same number of documents from each peer Problem: different database quality • 3. Fetch best documents from peers, re-rank them and select top-k Problem: good solution, but how to compute final scores?

P3 P2 P1 DB2 DB3 DB1 Overlapping document set Result merging problem Objective: make scores completely comparable Solution: Replace allcollection-dependent statistics with global ones Baseline: obtain document scores estimation as they were placed in single database Difficulty: Overlapping influence statistics for score estimation. Methods: • GIDF – Global Inverted Document Frequency • ICF – Inverted Collection Frequency • CORI – merging used in CORI system • LM– Language Modeling Final goal: find method which produce most accurate scores and robust to overlapping LeftScore ≈ RightScore P DB1+2+3 Single database

Selected result merging methods (1) • GIDF: compute Global Inverted Document Frequency: DFi – number of documents with particular term on peer i, |Di| – overall number of documents on peer i. • ICF: replace IDF with Inverted Collection Frequency value CF – number of peers with particular term, |C| – number of collections (peers) in the system

Selected result merging methods (2) • CORI: COllection Retrieval Inference network DatabaseRank – obtained during Database Selection step LocalScore – Scores computed with local statistics constants are heuristics tuned for INQUERY search engine • LM: Language Modeling λ – smoothing parameter, heuristic tradeoff between two models P(q | global language model from all documents on all peers) P(q | documentlanguage model)

Experimental setup • TREC-2002, 2003 and 2004 Web Track datasets • 4 topics • 50 peers, 10-15 per topic • documents are replicated twice • 25 title queries, the topic distillation task 2002 and 2003 Web Track • 3 database selection algorithms • RANDOM – that’s it • CORI – de-facto standard • IDEAL – manually created

Experiments – CORI database ranking, all merging methods

Experiments – all database rankings, the best LM merging method

Experiments – IDEAL database ranking, the best LM merging method, limited statistics

Preference-based language model (1) • 1. Execute query on the best peer • 2. First top-k results assumed relevant (pseudo-relevance feedback) • 3. Estimate preference-based LM on these top-k documents • 4. Compute cross-entropy between LM of the document and preference-based LM • 5. Combine this ranking with the LM merging method

Preference-based language model (2) • globally normalized similarity score • preference-based similarity score • both are combined into final result merging scores • where • Q – query • tk – term in Q • G – entire document set over all peers • Dij – document • U – set of pseudo-relevant documents :

Experiments - – IDEAL database ranking, preference-based language model merging

Conclusions • All merging algorithms are very close in absolute retrieval effectiveness • Language modeling methods are more effective than TF*IDF based methods • Limited statistics is reasonable choice in a peer-to-peer setting • The pseudo-relevance feedback information from the topically organized collections slightly improves the retrieval quality

Result Merging in a Peer-to-Peer Web Search Engine MINERVA

Result Merging in a Peer-to-Peer Web Search Engine MINERVA

Presentation Transcript

Squirrel: A decentralized peer-to-peer web cache

Squirrel: A peer-to-peer web cache

Squirrel: A peer-to-peer web cache

Efficient Search in Peer to Peer Networks

Web Applications: Peer-to-Peer Networks

Peer to peer search engine

Search and Replication in Unstructured Peer-to-Peer Networks

Query Routing in Peer-to-Peer Web Search Engine

Peer-to-Peer Result Dissemination in High-Volume Data Filtering

MINERVA Infinity: A Scalable Efficient Peer-to-Peer Search Engine

Squirrel: A decentralized peer-to-peer web cache

Semantic Web and Peer to Peer

Improving Search in Peer-to-Peer Networks

PEER TO PEER FULL TEXT SEARCH

Peer-to-Peer Distributed Search

ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR

“A Local Search Mechanism for Peer-to-Peer Networks”

Squirrel: A decentralized peer-to-peer web cache

GALANX: An Efficient Peer-to-Peer Search Engine System

Bookmark-driven Query Routing in Peer-to-Peer Web Search

Efficient Search in Peer to Peer Networks

Peer-to-Peer Search Algorithms