1 / 27

Web Text Retrieval with a P2P Query-Driven Index

Alvis. Web Text Retrieval with a P2P Query-Driven Index. Gleb Skobeltsyn EPFL, Lausanne Switzerland July 26, 2007. Joint work with: Toan Luu Ivana Podnar Žarko Martin Rajman Karl Aberer. Goal. Our goal is to achieve scalable full-text retrieval with structured P2P networks.

tareq
Download Presentation

Web Text Retrieval with a P2P Query-Driven Index

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Alvis Web Text Retrieval with a P2P Query-Driven Index Gleb Skobeltsyn EPFL, Lausanne Switzerland July 26, 2007 • Joint work with: • Toan Luu • Ivana Podnar Žarko • Martin Rajman • Karl Aberer G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  2. Goal • Our goalis to achieve scalable full-text retrieval with structured P2P networks P2P G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  3. Distributed P2P IR architecture • Each peer provides a local document collection for search • Each peer is responsible for a fraction of the global index G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  4. P2P IR basics • P2P network (Distributed Hash Table) with N peers • Each peer maintains connections with logN neighbors • The posting list associated with a given indexing key is stored at the peer responsible for that key • This peer can be located in logN overlay hops k=hash(indexing_key) get(k) k=hash(indexing_key) put(k,posting_list) posting_list k -> p_list …-> … G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  5. h(“gleb”)-{d2,d3} h(“epfl”)-{d1,d2} h(t’)-{d4,d5} {d1,d2} {d2} (Naïve) Single-term indexing approach Single term based partitioning strategy leads to unscalable bandwidth con-sumption at retrieval (frequent intersections of large posting lists) Query: “epfl & gleb” G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  6. Single-term vs. multi-term P2P indexing G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  7. Multi-term indexing: framework • Each peer is responsible for a set of indexing keys • Each indexing key= {term1, term2, .., termk}, k>0 • Keys are assigned to peers by the underlying DHT using the standard hashing mechanism • Each key is associated with a truncated posting list (TPL) that stores at most DFmax top-ranked document references • Distributed index contains {key,TPL} pairs G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  8. Single-term vs. multi-term P2P indexing voc. sizecould growexponentially! How to select keys to keep a satisfactory retrieval quality? G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  9. Indexing with HDK (Podnar et al. ICDE’07) • Document-Driven key generation: • Each time a new document is indexed, some posting lists for an indexin key k can reach the max size of DFmax • It triggersthe generation of new keys (k + additional frequent keys) • Use a number of filters to reduce the number of keys (proximity, redundancy and size filters) G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  10. Indexing with HDK • Pro’s: • ICDE’07 paper proves that the approach is scalable • Elegant key generation mechanism • Low bandwidth during retrieval (PL’s of limited size) • Con’s: • Practically the number of keys is still LARGE: • 113keys/doc (68M for 0.6M docs) • High bandwidth consumption at indexing • Problem: • Too many keys are superfluous (almost never used) • Let’s index what is queried! G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  11. Contents • Introduction • Single-term vs. multi term indexing • HDK approach for indexing • Query-driven approach for indexing/retrieval • Indexing structure • Example • Scalability • Evaluation • Conclusion G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  12. QDI: Query-Driven Index • Query-Driven Indexing strategy solves the “Too-Many-Keys” problem: • Avoids maintenance of superfluous keys • Generates only such keys that are requested by users on-the-fly • Utilizes query-log to discover such keys (monitors query frequency) G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  13. QDI: Challenges • Challenges • Indexing of a new key requires a bandwidth-efficient mechanism to obtain the top-k posting list associated with the key • Conventional intersection like threshold algorithm, but less often • Incomplete index causes degradation of query results quality • Show that the degradation is low G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  14. QDI: Retrieval ?abc nothing • Single term index is generated • Process abc • Probe Pabc • Probe PabPbc and Pac • Probe PaPb and Pc • Obtain top-DFmax results for a,b and c(ranked w.r.t a,b and c respectively) • Contact peers in the list, re-rank the obtained results w.r.t abc • Output top-10 • Inc. the QF for ab, bc and ac • Activate (index) ac ?abc peer ?abc b ab ac bc a c abc +1 +1 +1 popular nothing nothing nothing DFmax G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  15. QDI: Retrieval 2 • Assume the frequency of b is below DFmax • Note, how the redundancy filter simplifies the lattice in such a case (grayed nodes do not have to be activated) abc ab bc ac a b c abc ab bc DFmax G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  16. QDI: Retrieval 3 ?abc nothing • Single term index is generated and ac is indexed • Process abc • Probe Pabc • Probe PabPbc and Pac – obtain the result for ac • Probe Pb and obtain the result for b • Contact all peers in the list to re-rank the obtained results w.r.tabc • Output top-10 • Inc. the QF for ab, bcand ac ?abc peer ?abc ab abc c a ac bc b +1 +1 +1 nothing nothing G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  17. QDI: Summary • Each single-term found in the document collection has to be indexed. • We call all single-term keys a basic single term index. • The posting lists are truncated at DFmax. • A multi-term key k is activated (indexed) iff: • k is popular: QF(k) ≥QFmin, where QF(k) is the popularity of the key k derived from the available query log and QFminis a parameter for our model (popularity filter). • k contains from 2 to smax terms: 2≤|k|≤ smax, where smax is a parameter of our model (size filter). • all immediate sub-keys of k (of size |k-1|) are indexed and their associated postings lists are truncated (redundancy filter). G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  18. Scalability (see Skobeltsyn et al, Infoscale’07) • The retrievaltraffic is still scalable (depends on DFmax and a query size) • The indexingtraffic now depends on the number of activated keys • The number of keys does not depend on the document collection size but only on the size of the query log • We can use the QFminandDFmaxparameter to adjust the tradeoff: indexing traffic <—> retrieval quality G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  19. Contents • Introduction • Single-term vs. multi term indexing • HDK approach for indexing • Query-driven approach for indexing/retrieval • Indexing structure • Example • Scalability • Evaluation • Conclusion G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  20. AOL logs • 17M Queries from March, April, May 2006 (92 days) • 650K anonymous user sessions • Extracted all unique queries from each user session: … 2006-05-31 23:50:30 wearthbow.com native.cheyenne origin. 2006-05-31 23:50:30 l6 screensaver 2006-05-31 23:50:30 horses for sale in tn ky 2006-05-31 23:50:30 bank of america.com 2006-05-31 23:50:30 ask 2006-05-31 23:50:29 del rosa lanes 2006-05-31 23:50:28 www.spirit airlines.com 2006-05-31 23:50:28 find holy women of the bible 2006-05-31 23:50:27 trains 2006-05-31 23:50:27 todaysmiricles 2006-05-31 23:50:27 constition 2006-05-31 23:50:26 german grocceries in las vegas nv 2006-05-31 23:50:25 porn 2006-05-31 23:50:25 northwest indiana 2006-05-31 23:50:24 united.eprize.net 2006-05-31 23:50:24 jessica laguna … <-0.7Gb G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  21. Overlap experiment • Use the query-log to build the index (days 1..91) • Choose randomly 2K test queries from the day 92 • Answer each test query with Google and compare to the union of top-DFmax Google results for each of its combinations that areindexed according to the logs. • Mimics our P2PIR system if Google’s ranking is used. • Example: Non-superfluous (indexed) combinations Original query X X overlap@5=3/5=60% G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  22. Overlap example • Cut-n-paste from the simulation log: >id=481,q=“what did babe ruth do in the 1920” “1920 babe ruth”, qf=0---->Ov@100=100% “1920 babe”, qf=0--------->Ov@100= 9% +++“1920 ruth”, qf=1--------->Ov@100=33% +++“babe ruth”, qf=495 ------->Ov@100= 69% ---“1920”, qf=716 ------------>Ov@100= 1% ---“babe”, qf=3196 ----------->Ov@100= 2% ---“ruth”, qf=1653 ----------->Ov@100= 7% Size: 192, Keys used: 2, Overlap@100: 94% G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  23. Overlap experiment: impact of QFmin, DFmax Top20 overlap: impact of QFminwith DFmax=600 Top20 overlap: impact of DFmax with QFmin=1 G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  24. Overlap experiment: impact of the log size Top20 overlap: impact of the log size (Qfmin=1, DFmax=600) G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  25. TREC Experiment • WT10G collection (~1.69 M docs) • 100 TREC queries (from TREC Web Track 9 & 10) • Query statistics generated form 17M AOL queries • Using Okapi-BM25 weighting schema to compute ranking score • QFmin = 1, 3, 5, ∞ • DFmax= 100, 500 • smax=3 TREC: Precision at Top Ranked Pages (table) Precision is similar to centralized indexing G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  26. Conclusions • We presented the query-driven indexing strategy for scalable web text retrieval with structured P2P networks: • Keeps only popularand non-redundant multi-term keys in the index • Associates keys with truncatedposting lists • And we showed that: • With real query-logs our approach achieves good retrieval quality comparable to a centralized engine • The QFmin parameter alows to adjust the traffic/quality tradeoff G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

  27. Last slide Thank you for your attention! Questions? AlvisP2P web site: http://globalcomputing.epfl.ch/alvis/ G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

More Related