Query Routing in Peer-to-Peer Web Search Engine

Query Routing in Peer-to-Peer Web Search Engine International Max Planck Research School for Computer Science Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender

Talk Outline • Motivation • Proposed Search Engine architecture • Query routing and database selection • Similarity-based measures • Example: GlOSS • Document-frequency-based measures • Example: CORI • Evaluation of methods • Proposals • Conclusion

Problems of present Web Search Engines • Size of indexable Web: • Web is huge, it’s difficult to cover all • Timely re-crawls are required • Technical limits • Deep Web • Monopoly of Google: • Controls 80% of web search requests • Paid sites get updated more frequently and get higher rating • Sites may be censored by engine

Crawler Ranking of peer usefulness (richness) for keyword Crawler Crawler 0 1 7 computer elephant cancer Peer 4 Peer 3 Peer 2 6 2 Peer 3 Peer 4 Peer 1 Peer 1 Peer 2 Peer 4 5 3 Chord Ring Peer 2 Peer 1 Peer 3 4 Make use of Peer-to-Peer technology • Exploit previously unused CPU/memory/disk power • Provide up-to-date results for small portions of Web • Conquer Deep Web by personalized and specialized web crawlers Global Directory Global directory must be shared among peers!

query Query routing • Goal: find peers with relevant documents • Known before as Database Selection Problem • Not all techniques are applicable to P2P

Database Selection Problem • 1st inference: Is this document relevant? • It’s a subjective user judgment, we model it • We use only representations of user needs and documents (keywords, inverted indices) • 2nd inference: Database is potentialto satisfy query, if it • has many documents (size-based naive approach) • has many documents, containing all query words • high number of them with given similarity • high summarized similarity of them

Measuring usefulness • Number of documents with all query words is unknown • no full document representations available, • only database summaries (representatives) • 3rd Inference (usefulness) is built on top of previous two • Steps of database selection • Rely on sensible 1st and 2nd inferences • Choose database representatives for 3rd inference • Calculate usefulness measures • Choose most useful databases

Similarity-based measures • Definition: Usefulness is a sum of document similarities, exceeding threshold l • Simplest: summarized weight of query terms across collection • no assumptions about word cooccurrence • l = 0

GlOSS • High correlation assumption: • Sort all n query terms Ti in descendant order of their DF’s • DFn → Tn, Tn-1, … , T1, • DFn-1 – DFn→ Tn-1, Tn-2 , … , T1 ,… , • DF1 – DF2→ T1 • Use averaged term weights to calculate document similarity • l > 0 • l is query dependent • l is collection dependent • Usually because of local IDF’s difference • Proposal: use global term importance • Usually l is set to 0 in experiments

Problems of similarity-based measures • Is this inference good? • A few high-scored documents and a lot of low scored documents are regarded as equal • Proposal: summarize first K similarities • Highly scored documents could be bad indicator of usefulness • Most of relevant documents have moderate scores • Highly scored documents could be non-relevant

Document frequency based measures • Don’t use term frequencies (actual similarities) • Exploit document frequencies only • Exploit global measure of term importance • Average IDF • ICF (inversed collection frequency) = • Main assumption: many documents with rare terms • have more meaning for user • most likely contain other query terms

DF : document frequency of query term DFMAX : maximum document frequency among all terms in collection CF : number of collections, containing query term |C| : number of collections in the system CORI: Using TFIDF normalization

CORI Issues • Pure document frequencies make CORI better • The less statistics, the simpler • Smaller variance • Better estimates ranking, not actual database summaries • No use of document richness • To be normalized or not to be? • Small databases are not necessary better • Collection may specialize well in several topics

Inform. Inform. Inform. Retrieval Peer2 Peer3 Peer3 Peer3 Peer1 Peer3 Peer1 Peer1 Peer2 Peer1 Peer2 Peer2 Using usefulness measures Information: CF = 120 Retrieval: CF = 40 |C| = 1000 DF avg_tf DFmax DF avg_tf DFmax 20 Peer1 12 60 Peer1 5 8 60 Peer2 60 6 400 Peer2 10 4 400 Peer3 20 15 60 Peer3 5 10 60

Analysis of experiments • CORI is the best, but • Only when choosing more than 50 from 236 databases • Only 10% better when choosing more than 90 databases • Test collections are strange • Chronologically or even randomly separated documents • No topic specificity • No actual Web data used • No overlapping among collections • Experiments are unrealistic, it’s unclear • Which method is better • Is there any satisfactory method

Possible solutions • Most of measures could be unified in framework • We can play with it and try • Various normalization schemes • Different notions of term importance (ICF, local IDF) • Use statistics of top documents • Change the power of factors • DF·ICF 4 is not worse than CORI • Change the form of expression GlOSS CORI

Conclusion What done: • Measures are analytically evaluated • Sensible subset of measures is chosen • Measures are implemented What could be done next: • Carry out new sensible experiments • Choose appropriate usefulness measure • Experiment with database representatives • Build own measure • Try to exploit collections metadata • Bookmarks, authoritative documents, collection descriptions

Thank you for attention!

Query Routing in Peer-to-Peer Web Search Engine

Query Routing in Peer-to-Peer Web Search Engine

Presentation Transcript

Self Regulated Search in Unstructured Peer-to-Peer Networks

Efficient Search in Peer to Peer Networks

Web Applications: Peer-to-Peer Networks

Peer to peer search engine

Search and Replication in Unstructured Peer-to-Peer Networks

Making Peer-to-Peer Anonymous Routing Resilient to Failures

MINERVA Infinity: A Scalable Efficient Peer-to-Peer Search Engine

Result Merging in a Peer-to-Peer Web Search Engine MINERVA

Semantic Web and Peer to Peer

Compact Routing and Locality in Peer-to-Peer Systems

Brocade: Landmark Routing on Peer to Peer Networks

Improving Search in Peer-to-Peer Networks

PEER TO PEER FULL TEXT SEARCH

Routing Indices For Peer-to-Peer Systems

Peer-to-Peer Distributed Search

Routing Indices For Peer-to-Peer Systems

Fault-tolerant Routing in Peer-to-Peer Systems

A Dynamic Routing Protocol for Keyword Search in Unstructured Peer-to-peer Networks

GALANX: An Efficient Peer-to-Peer Search Engine System

Bookmark-driven Query Routing in Peer-to-Peer Web Search

Efficient Search in Peer to Peer Networks

Peer-to-Peer Search Algorithms