180 likes | 296 Views
Query Routing in Peer-to-Peer Web Search Engine. International Max Planck Research School for Computer Science. Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender. Talk Outline. Motivation Proposed Search Engine architecture
E N D
Query Routing in Peer-to-Peer Web Search Engine International Max Planck Research School for Computer Science Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender
Talk Outline • Motivation • Proposed Search Engine architecture • Query routing and database selection • Similarity-based measures • Example: GlOSS • Document-frequency-based measures • Example: CORI • Evaluation of methods • Proposals • Conclusion
Problems of present Web Search Engines • Size of indexable Web: • Web is huge, it’s difficult to cover all • Timely re-crawls are required • Technical limits • Deep Web • Monopoly of Google: • Controls 80% of web search requests • Paid sites get updated more frequently and get higher rating • Sites may be censored by engine
Crawler Ranking of peer usefulness (richness) for keyword Crawler Crawler 0 1 7 computer elephant cancer Peer 4 Peer 3 Peer 2 6 2 Peer 3 Peer 4 Peer 1 Peer 1 Peer 2 Peer 4 5 3 Chord Ring Peer 2 Peer 1 Peer 3 4 Make use of Peer-to-Peer technology • Exploit previously unused CPU/memory/disk power • Provide up-to-date results for small portions of Web • Conquer Deep Web by personalized and specialized web crawlers Global Directory Global directory must be shared among peers!
query Query routing • Goal: find peers with relevant documents • Known before as Database Selection Problem • Not all techniques are applicable to P2P
Database Selection Problem • 1st inference: Is this document relevant? • It’s a subjective user judgment, we model it • We use only representations of user needs and documents (keywords, inverted indices) • 2nd inference: Database is potentialto satisfy query, if it • has many documents (size-based naive approach) • has many documents, containing all query words • high number of them with given similarity • high summarized similarity of them
Measuring usefulness • Number of documents with all query words is unknown • no full document representations available, • only database summaries (representatives) • 3rd Inference (usefulness) is built on top of previous two • Steps of database selection • Rely on sensible 1st and 2nd inferences • Choose database representatives for 3rd inference • Calculate usefulness measures • Choose most useful databases
Similarity-based measures • Definition: Usefulness is a sum of document similarities, exceeding threshold l • Simplest: summarized weight of query terms across collection • no assumptions about word cooccurrence • l = 0
GlOSS • High correlation assumption: • Sort all n query terms Ti in descendant order of their DF’s • DFn → Tn, Tn-1, … , T1, • DFn-1 – DFn→ Tn-1, Tn-2 , … , T1 ,… , • DF1 – DF2→ T1 • Use averaged term weights to calculate document similarity • l > 0 • l is query dependent • l is collection dependent • Usually because of local IDF’s difference • Proposal: use global term importance • Usually l is set to 0 in experiments
Problems of similarity-based measures • Is this inference good? • A few high-scored documents and a lot of low scored documents are regarded as equal • Proposal: summarize first K similarities • Highly scored documents could be bad indicator of usefulness • Most of relevant documents have moderate scores • Highly scored documents could be non-relevant
Document frequency based measures • Don’t use term frequencies (actual similarities) • Exploit document frequencies only • Exploit global measure of term importance • Average IDF • ICF (inversed collection frequency) = • Main assumption: many documents with rare terms • have more meaning for user • most likely contain other query terms
DF : document frequency of query term DFMAX : maximum document frequency among all terms in collection CF : number of collections, containing query term |C| : number of collections in the system CORI: Using TFIDF normalization
CORI Issues • Pure document frequencies make CORI better • The less statistics, the simpler • Smaller variance • Better estimates ranking, not actual database summaries • No use of document richness • To be normalized or not to be? • Small databases are not necessary better • Collection may specialize well in several topics
Inform. Inform. Inform. Retrieval Peer2 Peer3 Peer3 Peer3 Peer1 Peer3 Peer1 Peer1 Peer2 Peer1 Peer2 Peer2 Using usefulness measures Information: CF = 120 Retrieval: CF = 40 |C| = 1000 DF avg_tf DFmax DF avg_tf DFmax 20 Peer1 12 60 Peer1 5 8 60 Peer2 60 6 400 Peer2 10 4 400 Peer3 20 15 60 Peer3 5 10 60
Analysis of experiments • CORI is the best, but • Only when choosing more than 50 from 236 databases • Only 10% better when choosing more than 90 databases • Test collections are strange • Chronologically or even randomly separated documents • No topic specificity • No actual Web data used • No overlapping among collections • Experiments are unrealistic, it’s unclear • Which method is better • Is there any satisfactory method
Possible solutions • Most of measures could be unified in framework • We can play with it and try • Various normalization schemes • Different notions of term importance (ICF, local IDF) • Use statistics of top documents • Change the power of factors • DF·ICF 4 is not worse than CORI • Change the form of expression GlOSS CORI
Conclusion What done: • Measures are analytically evaluated • Sensible subset of measures is chosen • Measures are implemented What could be done next: • Carry out new sensible experiments • Choose appropriate usefulness measure • Experiment with database representatives • Build own measure • Try to exploit collections metadata • Bookmarks, authoritative documents, collection descriptions