170 likes | 279 Views
Result Merging in a Peer-to-Peer Web Search Engine MINERVA. Master thesis project. Speaker: Sergey Chernov Supervisor: Prof. Dr. Gerhard Weikum Saarland university, Max Planck Institute for Computer Science, Database and Information Systems Group. Overview.
E N D
Result Merging in a Peer-to-Peer Web Search Engine MINERVA Master thesis project Speaker: Sergey Chernov Supervisor: Prof. Dr. Gerhard Weikum Saarland university, Max Planck Institute for Computer Science, Database and Information Systems Group
Overview 1 Result merging problem in MINERVA system 4 Summary and future work 2 Selected merging strategies: GIDF, ICF, CORI, LM 3 Our approach: result merging with the preference-based language model
Problems of present Web Search Engines • Size of indexable Web: • Web is huge, it’s difficult to cover all • Timely re-crawls are required • Deep Web • Monopoly of Google: • Controls 80% of web search requests • Sites may be censored by engine • Make use of Peer-to-Peer technology: • Exploit previously unused CPU/memory/disk power • Keep up-to-date results for small portions of Web • Conquer Deep Web by specialized web crawlers
MINERVA project • MINERVA is a Peer-to-Peer Web search engine Representative statistics C1 Si C6 P1 P6 Peer with local search engine Pi S6 S1 Index on crawled pages P2 C2 Ci P5 C5 S2 S5 S3 Distributed directory based on Chord protocol S4 P4 P3 C3 C4 Chord ring
Query processing in distributed IR <<R1, R2, R3,>,q> <P,q> Selection <P’,q> Retrieval Merging RM RM ................... ................... ................... ................... P1 P1’ R1 ................... ................... ................... ................... P2 P3 P2’ R2 ................... ................... ................... ................... P4 P3’ R3 ................... ................... ................... ................... P5 q – query, P – set of peers, P’ – subset of peers most “relevant” for q Ri – ranked result list of documents from Pi, Rm – merged result list of documents P6
Naive merging approaches • How we can combine results from peers? • 1. Retrieve k documents with the highest similarity scores Problem: scores incomparable • 2.Take the same number of documents from each peer Problem: different database quality • 3. Fetch best documents from peers, re-rank them and select top-k Problem: good solution, but how to compute final scores?
P3 P2 P1 DB2 DB3 DB1 Overlapping document set Result merging problem Objective: make scores completely comparable Solution: Replace allcollection-dependent statistics with global ones Baseline: obtain document scores estimation as they were placed in single database Difficulty: Overlapping influence statistics for score estimation. Methods: • GIDF – Global Inverted Document Frequency • ICF – Inverted Collection Frequency • CORI – merging used in CORI system • LM– Language Modeling Final goal: find method which produce most accurate scores and robust to overlapping LeftScore ≈ RightScore P DB1+2+3 Single database
Selected result merging methods (1) • GIDF: compute Global Inverted Document Frequency: DFi – number of documents with particular term on peer i, |Di| – overall number of documents on peer i. • ICF: replace IDF with Inverted Collection Frequency value CF – number of peers with particular term, |C| – number of collections (peers) in the system
Selected result merging methods (2) • CORI: COllection Retrieval Inference network DatabaseRank – obtained during Database Selection step LocalScore – Scores computed with local statistics constants are heuristics tuned for INQUERY search engine • LM: Language Modeling λ – smoothing parameter, heuristic tradeoff between two models P(q | global language model from all documents on all peers) P(q | documentlanguage model)
Experimental setup • TREC-2002, 2003 and 2004 Web Track datasets • 4 topics • 50 peers, 10-15 per topic • documents are replicated twice • 25 title queries, the topic distillation task 2002 and 2003 Web Track • 3 database selection algorithms • RANDOM – that’s it • CORI – de-facto standard • IDEAL – manually created
Experiments – all database rankings, the best LM merging method
Experiments – IDEAL database ranking, the best LM merging method, limited statistics
Preference-based language model (1) • 1. Execute query on the best peer • 2. First top-k results assumed relevant (pseudo-relevance feedback) • 3. Estimate preference-based LM on these top-k documents • 4. Compute cross-entropy between LM of the document and preference-based LM • 5. Combine this ranking with the LM merging method
Preference-based language model (2) • globally normalized similarity score • preference-based similarity score • both are combined into final result merging scores • where • Q – query • tk – term in Q • G – entire document set over all peers • Dij – document • U – set of pseudo-relevant documents :
Experiments - – IDEAL database ranking, preference-based language model merging
Conclusions • All merging algorithms are very close in absolute retrieval effectiveness • Language modeling methods are more effective than TF*IDF based methods • Limited statistics is reasonable choice in a peer-to-peer setting • The pseudo-relevance feedback information from the topically organized collections slightly improves the retrieval quality