220 likes | 230 Views
Explore a groundbreaking self-organizing P2P web search engine with Google-level functionality. Enhance search quality, methods, and collaboration among peers to break information monopolies.
E N D
Introduction Why Peer-to-Peer Web Search? Vision: Self-organizing P2P Web Search Engine with Google-or-better functionality • Proof of Concept for Scalable & Self-Organizing • Data Structures and Algorithms • (e.g., DHTs, Randomized Overlay Networks, Epidemic Spreading) • Testbed for CS Models, Algorithms, Technologies • and Experimental Platform • Better Search ResultQuality(Precision, Recall, etc.) • Powerful Search Methods for Each Peer • (Concept-based Search, Query Expansion, Personalization, etc.) • Leverage Intellectual Input at Each Peer • (Bookmarks, Feedback, Query Logs, Click Streams, Evolving Web, etc.) • Collaboration among Peers • (Query Routing, Incentives, Fairness, Anonymity, etc.) • Breaking Information Monopolies
Introduction What Google Can‘t Do Killer queries (disregarding NLP QA, multilingual, multimedia): drama with three women making a prophecy to a British nobleman that he will become king
Introduction Outline Vision • Demo • Efficient Top-k Search • Ontology-based Query Expansion • Exploiting User Behavior • Isolating Selfish Peers
Introduction Outline Vision Demo • Efficient Top-k Search • Ontology-based Query Expansion • Exploiting User Behavior • Isolating Selfish Peers
Efficient Top-k Search Efficient Top-k Search TA: efficient & principled top-k query processing with monotonic score aggr. TA with sorted access only (NRA) (Fagin 01, Güntzer/Kießling/Balke 01): can index lists; consider d at posi in Li; E(d) := E(d) {i}; highi := s(ti,d); worstscore(d) := aggr{s(t,d) | E(d)}; bestscore(d) := aggr{worstscore(d), aggr{high | E(d)}}; if worstscore(d) > min-k then add d to top-k min-k := min{worstscore(d’) | d’ top-k}; else if bestscore(d) > min-k then cand := cand {d}; s threshold := max {bestscore(d’) | d’ cand}; if threshold min-k then exit; Data items: d1, …, dn d1 s(t1,d1) = 0.7 … s(tm,d1) = 0.2 Query: q = (t1, t2, t3) Index lists k = 1 d78 0.9 d23 0.8 d10 0.8 d1 0.7 d88 0.2 t1 Scan depth 1 … Scan depth 2 Scan depth 3 d64 0.8 d23 0.6 d10 0.6 d10 0.2 d78 0.1 t2 … Ex. Google: > 10 mio. terms > 8 bio. docs > 4 TB index d10 0.7 d78 0.5 d64 0.4 d99 0.2 d34 0.1 STOP! t3 …
Add d to top-k result, if worstscore(d) > min-k Drop d only if bestscore(d) < min-k, otherwise keep in PQ score predictor can use LSTs & Chernoff bounds, Poisson approximations, or histogram convolution Probabilistic Pruning Probabilistic Pruning of Top-k Candidates TA family of algorithms based on invariant (with sum as aggr) worstscore(d) bestscore(d) score ? drop d from priority queue bestscore(d) min-k Often overly conservative (deep scans, high memory for PQ) scan depth worstscore(d) • Approximate top-k with probabilistic guarantees: discard candidates d from queue if p(d) E[rel. precision@k] = 1
speedup by factor 10 at high precision/recall (relative to TA-sorted); aggressive queue mgt. even yields factor 100 at 30-50 % prec./recall Experiments with TREC-12 Web Track Experiments with TREC-12 Web-Track Benchmark on .GOV corpus from TREC-12 Web track: 1.25 Mio. docs (html, pdf, etc.) • 50 keyword queries, e.g.: • „Lewis Clark expedition“, • „juvenile delinquency“, • „legalization Marihuana“, • „air bag safety reducing injuries death facts“ TA-sorted Prob-sorted (smart) #sorted accesses 2,263,652 527,980 elapsed time [s] 148.7 15.9 max queue size 10849 400 relative precision 1 0.87 rank distance 0 39.5 score error 0 0.031
Introduction Outline Vision Demo Efficient Top-k Search • Ontology-based Query Expansion • Exploiting User Behavior • Isolating Selfish Peers
Query Expansion Query Expansion Threshold-based query expansion: substitute ~w by (c1 | ... | ck) with all ci for which sim(w, ci) „Old hat“ in IR; highly disputed for danger of topic dilution • Approach to careful expansion: • determine phrases from query or best initial query results • (e.g., forming 3-grams and looking up ontology/thesaurus entries) • if uniquely mapped to one concept • then expand with synonyms and weighted hyponyms • alternatively use statistical learning methods • for word sense disambiguation Problem: choice of threshold
... for organizing the illicit export of metals and import of arms. It is extremely difficult for the law-enforcement organs to investigate and stamp out corruption among leading officials. ... Let us take, for example, the case of Medellin cartel's boss Pablo Escobar. Will the fact that he was eliminated change anything at all? No, it may perhaps have a psychological effect on other drug dealers but, ... A parliamentary commission accused Swiss prosecutors today of doing little to stop drug and money-laundering international networks from pumping billions of dollars through Swiss companies. ... Query Expansion Example Query Expansion Example From TREC 2004 Robust Track: Title:International Organized Crime Description: Identify organizations that participate in international criminal activity, the activity, and, if possible, collaborating organizations and the countries involved. Query = {international[0.145|1.00], ~META[1.00|1.00][{gangdom[1.00|1.00], gangland[0.742|1.00], "organ[0.213|1.00] & crime[0.312|1.00]", camorra[0.254|1.00], maffia[0.318|1.00], mafia[0.154|1.00], "sicilian[0.201|1.00] & mafia[0.154|1.00]", "black[0.066|1.00] & hand[0.053|1.00]", mob[0.123|1.00], syndicate[0.093|1.00]}], organ[0.213|1.00], crime[0.312|1.00], collabor[0.415|0.20], columbian[0.686|0.20], cartel[0.466|0.20], ...}} • 135530 sorted accesses in 11.073s. • Results: • Interpol Chief on Fight Against Narcotics • Economic Counterintelligence Tasks Viewed • Dresden Conference Views Growth of Organized Crime in Europe • Report on Drug, Weapons Seizures in Southwest Border Region • SWITZERLAND CALLED SOFT ON CRIME • ...
response time: 0.7 throughput: 0.6 92: 0.9 37: 0.9 67: 0.9 44: 0.8 52: 0.9 22: 0.7 44: 0.8 23: 0.6 55: 0.8 51: 0.6 ... 52: 0.6 ... Top-k with Query Expansion Top-k Query Processing with Query Expansion consider expandable query „algorithm and ~performance“ with score iq {max jonto(i) { sim(i,j)*sj(d)) }} dynamic query expansion with incremental on-demand merging of additional index lists B+ tree index on terms thesaurus / meta-index algorithm performance performance 57: 0.6 12: 0.9 response time: 0.7 throughput: 0.6 queueing: 0.3 delay: 0.25 ... 44: 0.4 44: 0.4 14: 0.8 52: 0.4 28: 0.6 33: 0.3 17: 0.55 75: 0.3 61: 0.5 ... 44: 0.5 ... + much more efficient than threshold-based expansion + no threshold tuning + no topic drift
speedup by factor 4 at high precision/recall; no topic drift, no need for threshold tuning; also handles TREC-13 Terabyte benchmark Experiments with TREC-13 Robust Track Experiments with TREC-13 Robust-Track Benchmark on Acquaint corpus (news articles): 528 000 docs, 2 GB raw data, 8 GB for all indexes 50 most difficult queries, e.g.: „transportation tunnel disasters“ „Hubble telescope achievements“ potentially expanded into: „earthquake, flood, wind, seismology, accident, car, auto, train, ...“ „astronomical, electromagnetic radiation, cosmic source, nebulae, ...“ no exp. static exp. static exp. incr. merge (=0.1) (=0.3, (=0.3, (=0.1) =0.0) =0.1) #sorted acc. 1,333,756 10,586,175 3,622,686 5,671,493 #random acc. 0 555,176 49,783 34,895 elapsed time [s] 9.3 156.6 79.6 43.8 max #terms 4 59 59 59 relative prec. 0.934 1.0 0.541 0.786 precision@10 0.248 0.286 0.238 0.298 MAP 0.091 0.111 0.086 0.110 with Okapi BM25 probabilistic scoring model
Introduction Outline Vision Demo Efficient Top-k Search Ontology-based Query Expansion • Exploiting User Behavior • Isolating Selfish Peers
Exploiting User Behavior Exploiting Query Logs and Click Streams from PageRank: uniformly random choice of links + random jumps Authority (page q) = stationary prob. of visiting q
a b a xyz Exploiting User Behavior Exploiting Query Logs and Click Streams from PageRank: uniformly random choice of links + random jumps to QRank: + query-doc transitions + query-query transitions + doc-doc transitions on implicit links (w/ thesaurus) with probabilities estimated from log statistics
Exploiting User Behavior Preliminary Experiments Setup: 70 000 Wikipedia docs, 18 volunteers posing Trivial-Pursuit queries ca. 500 queries, ca. 300 refinements, ca. 1000 positive clicks ca. 15 000 implicit links based on doc-doc similarity • Results (assessment by blind-test users): • QRank top-10 result preferred over PageRank in 81% of all cases • QRank has 50.3% precision@10, PageRank has 33.9% Untrained example query „philosophy“: PageRank QRank x 1. Philosophy Philosophy 2. GNU free doc. license GNU free doc. license 3. Free software foundation Early modern philosophy 4. Richard Stallman Mysticism 5. Debian Aristotle
Introduction Outline Vision Demo Efficient Top-k Search Ontology-based Query Expansion Exploiting User Behavior • Isolating Selfish Peers
? ? url x: 37, 44, 12, ... peer lists (directory) term g: 13, 11, 45, ... term c: 13, 92, 45, ... term a: 17, 11, 92, ... term f: 43, 65, 92, ... url z: 54, 128, 7, ... ? url y: 75, 43, 12, ... book- marks B0 query peer P0 term g: 13, 11, 45, ... local index X0 Susceptible to misbehavior! How do we identify and penalize or isolate selfish/malicious peers? Self-Organization for Isolating Selfish Peers Collaborative P2P Search
Self-Organization for Isolating Selfish Peers Self-Organization for Isolating Selfish Peers • Rationale: • mimic evolution in biological / social networks • tag selfish vs. altruistic peers and bias interactions towards similar peers • Algorithm: • periodically do • each peer compares its “utility” with a random peer • if the other peer has higher utility then • copy that peer’s strategy and links (reproduction) • mutate with small probability: change behavior, change links
Self-Organization for Isolating Selfish Peers Simulation Results for P2P File Sharing • peers generate queries and answer queries based on P [0,1] • with extreme behaviors: selfish P = 1.0 and altruistic P = 0.0 • peer utility = # hits (queries answered) • mutation: change P randomly queries generated hits 60 typical run for 104 peers Selfishness reduces 50 40 average per node 30 Average performance increases 20 10 0 cycles 0 20 40 60 80 100
The End Thank you!