160 likes | 482 Views
Santa Barbara, California, USA February 27-28, 2006 IPTPS 2006 - The 5th International Workshop on Peer-to-Peer System P2P Content Search: Give the Web Back to the People Outline of the Talk Feasibility of P2P Web Search Problem Statement Learning from Queries Exploiting Correlation
E N D
Santa Barbara, California, USA February 27-28, 2006 IPTPS 2006 - The 5th International Workshop on Peer-to-Peer System P2P Content Search: Give the Web Back to the People Outline of the Talk Feasibility of P2P Web Search Problem Statement Learning from Queries Exploiting Correlation Experiments Christian Zimmer, Matthias Bender, Sebastian Michel, Gerhard Weikum Max-Planck-Institut for Informatics, Saarbrücken, Germany Peter Triantafillou University of Patras, Greece
P2P and Web Search: Marriage in Heaven Li, Loo, Hellerstein, Kaashoek, Karger, Morris questioned Feasibility of Peer-to-Peer Web Indexing and Search(IPTPS 2003) But: Authors assume distribution of full term-document index non-scalable! Better: light-weight approach with distributed term-peer directory Variety of projects following this line: PlanetP (Rutgers), Pepper (CMU), Galanx (Wisconsin), Odissea (Brooklyn), Minerva (MPII), and others P2P Web Search has potential advantages: • Highly distributed data • Better processing power
Architectural Model Peers are connected by overlay network (e.g. DHT, random graph) and IP Each peer has full-fledged local search engine (with crawler / importer, indexer, query processor) Each peer has autonomously compiled (e.g. crawled) its own content according to the user‘s thematic interests peer-specific collections When a query is issued by a peer, it is first executed locally and then possibly routed to carefully selected other peers Peers can post summaries / synopses / metadata / QoS info to (distr.) network-wide directory with efficient per-key lookup
peer ranking and statistics peer ranking and statistics P3 term b: P3,P5,P8 P6 term f: P2,P4,P6 peer lists P2 term a: P1,P4,P8 peer ranking and statistics b a P4 term e: P1,P2,P5 term c: P2,P4,P6 P7 P5 c P1 P8 term d: P1,P3 query peer local index Minerva System Architecture • Based on top of a scalable, churn-resilient DHT • Conceptually global but physically distributed meta-data directory Query Routing driven by statistics on peer quality
Pi native: P27, P4, P8, P112, P36, ... Doc1 american music Doc2 native american Pq Pj american: P1, P4, P18, P108, P25, ... Pk music: P13, P4, P88, P36, ... Post native Post american Post music Problem Statement Example Query q: „native american music“ • Ask global directory for three single-term PeerLists • Combine into single PeerList for complete query • Ask top peers for best documents • Combine all documents into single result documents What can happen? • Great results: top peers for q are selected! • Bad results: selected peers good for individual terms, mediocre for complete query.
Problem: Term Correlations Queries with correlated or specifically „associated“ termsets: • „Michael Jordan“, „Lake Superior“, „Bell Labs“, „hurricane Katrina“, „Native American Music“, „PhD admission“, „black magic“, „ice hockey Honolulu“, „Natalya Kournikova“ Architectural compromise: • Best peers for q={t1, …, t|q|} may not be in tqPeerList(t)top-k and possibly not even in tqPeerList(t)top-k • Also possible: tqPeerList(t)top-k is empty! • Name and phrase recognition helps but insufficient • Lack of correlation-awareness is standard in IR, but more severe in P2P because of peer-granularity directory Consider correlated termsets for query routing! The solution: • Special handling of correlated termsets as termset posts in the directory, but... • ... efficiency & scalability are critical!
Critical Issues... ... and what remains to be done? • How to decide that a termset is correlated? • How to store termset posts in the directory? • How to exploit termset posts for queries?
Possible Approaches Extraction of all possible term pairs out of the documents • Brute-force precomputation of termset posts • But: quadratic explosion and what about triples, quadruples, ... Possible sources of correlated termsets • Names and phrases from dictionaries or thesauries incomplete! • Frequent itemset mining on data computationally expensive! Impossible to predict all correlated termsets of interest!
Our Approach... ...driven by „Give the Web back to the people“ Exploit query logs to learn correlated termsets Advantages of query logs: • Reflect real behavior of millions of user • Only termsets of interest need to be learned as correlated • As we will see: Integration in existing architecture for free Queries are a gold mine! Looking at query logs... • ... to validate that logs are useful to recognize correlated termsets • Excite Search Engine Log (1999) with about 2 million real web queries
american music native american music american native native: P3,P5,P8 american: P1,P4,P8 P3 P2 P5 native native american music P4 american native american music P8 P6 music native american music music: P2,P4,P6 P7 music native Learning Correlated Termsets from Queries • Peerlist request: piggybacking complete query • Directory peers remember query as termsets Learning included in Query Routing P1
american music native american music american native P3 P2 native: P3,P5,P8 american: P1,P4,P8 P5 Post american native Post american Post native P4 american music native american music american native P8 P6 P1 P7 music: P2,P4,P6 music native Collecting and Storing Termset Posts • Directory Peers manage termset posts • Posting procedure extended with termset posting american native: P8 No extra Communication Protocol needed!
american music native: P8 P3 P2 native: P3,P5,P8,P2 american: P1,P4,P8 P5 PeerList native native native american music P4 PeerList american music native american native american music P8 P6 native music: P8,P4 P1 P7 music native american music music: P2,P4,P6,P8 PeerList music native Exploiting Termset Postings • Integrated in standard query execution • Fallback-option always possible No additional Communication Round! PeerList for complete query
P3 P2 P5 b a b c d e P4 a b c a a b c d e c a b c d e a b d b c e P8 c e P6 d a b c d e d e P7 e a b c d e e No Termset for Complete Query • Especially for large queries • Covering problem! a b c b c e a b d b c a b b a c e c Integrated into Query Routing! d e e P1 e a b c a b d b c e a b c d e c e d e e
What about Networking Costs? Big Concern: too many messages, high bandwidth consumption, too? All messages piggybacked, no extra costs! • Learning correlated termsets integrated in the query routing process • Asking for termsets integrated in the posting process • Exploiting correlated termsets in the query processing for free and includes the fallback option, too ... It‘s all free!! Our approach is still scalable because...
Experimental Evaluation • Experiments: 750 peers with .Gov partitions (~1.2 million web documents) • Running 50 expanded queries from TREC-2003 Web Track (example: „robots research artificial“ or „shipwrecks accident“) Major Gain in Benefit / Cost
Conclusion and Future Work • Reconcile scalability with good search-result quality • No extra networking costs and... • ... greatly improved benefit/cost for query routing and processing • Consider and benefit from user and community behavior • Optimization of termset covers for queries with many terms • Real-life testbed with real users! Thank You for Your Attention!