270 likes | 385 Views
Alvis. Web Text Retrieval with a P2P Query-Driven Index. Gleb Skobeltsyn EPFL, Lausanne Switzerland July 26, 2007. Joint work with: Toan Luu Ivana Podnar Žarko Martin Rajman Karl Aberer. Goal. Our goal is to achieve scalable full-text retrieval with structured P2P networks.
E N D
Alvis Web Text Retrieval with a P2P Query-Driven Index Gleb Skobeltsyn EPFL, Lausanne Switzerland July 26, 2007 • Joint work with: • Toan Luu • Ivana Podnar Žarko • Martin Rajman • Karl Aberer G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
Goal • Our goalis to achieve scalable full-text retrieval with structured P2P networks P2P G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
Distributed P2P IR architecture • Each peer provides a local document collection for search • Each peer is responsible for a fraction of the global index G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
P2P IR basics • P2P network (Distributed Hash Table) with N peers • Each peer maintains connections with logN neighbors • The posting list associated with a given indexing key is stored at the peer responsible for that key • This peer can be located in logN overlay hops k=hash(indexing_key) get(k) k=hash(indexing_key) put(k,posting_list) posting_list k -> p_list …-> … G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
h(“gleb”)-{d2,d3} h(“epfl”)-{d1,d2} h(t’)-{d4,d5} {d1,d2} {d2} (Naïve) Single-term indexing approach Single term based partitioning strategy leads to unscalable bandwidth con-sumption at retrieval (frequent intersections of large posting lists) Query: “epfl & gleb” G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
Single-term vs. multi-term P2P indexing G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
Multi-term indexing: framework • Each peer is responsible for a set of indexing keys • Each indexing key= {term1, term2, .., termk}, k>0 • Keys are assigned to peers by the underlying DHT using the standard hashing mechanism • Each key is associated with a truncated posting list (TPL) that stores at most DFmax top-ranked document references • Distributed index contains {key,TPL} pairs G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
Single-term vs. multi-term P2P indexing voc. sizecould growexponentially! How to select keys to keep a satisfactory retrieval quality? G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
Indexing with HDK (Podnar et al. ICDE’07) • Document-Driven key generation: • Each time a new document is indexed, some posting lists for an indexin key k can reach the max size of DFmax • It triggersthe generation of new keys (k + additional frequent keys) • Use a number of filters to reduce the number of keys (proximity, redundancy and size filters) G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
Indexing with HDK • Pro’s: • ICDE’07 paper proves that the approach is scalable • Elegant key generation mechanism • Low bandwidth during retrieval (PL’s of limited size) • Con’s: • Practically the number of keys is still LARGE: • 113keys/doc (68M for 0.6M docs) • High bandwidth consumption at indexing • Problem: • Too many keys are superfluous (almost never used) • Let’s index what is queried! G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
Contents • Introduction • Single-term vs. multi term indexing • HDK approach for indexing • Query-driven approach for indexing/retrieval • Indexing structure • Example • Scalability • Evaluation • Conclusion G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
QDI: Query-Driven Index • Query-Driven Indexing strategy solves the “Too-Many-Keys” problem: • Avoids maintenance of superfluous keys • Generates only such keys that are requested by users on-the-fly • Utilizes query-log to discover such keys (monitors query frequency) G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
QDI: Challenges • Challenges • Indexing of a new key requires a bandwidth-efficient mechanism to obtain the top-k posting list associated with the key • Conventional intersection like threshold algorithm, but less often • Incomplete index causes degradation of query results quality • Show that the degradation is low G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
QDI: Retrieval ?abc nothing • Single term index is generated • Process abc • Probe Pabc • Probe PabPbc and Pac • Probe PaPb and Pc • Obtain top-DFmax results for a,b and c(ranked w.r.t a,b and c respectively) • Contact peers in the list, re-rank the obtained results w.r.t abc • Output top-10 • Inc. the QF for ab, bc and ac • Activate (index) ac ?abc peer ?abc b ab ac bc a c abc +1 +1 +1 popular nothing nothing nothing DFmax G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
QDI: Retrieval 2 • Assume the frequency of b is below DFmax • Note, how the redundancy filter simplifies the lattice in such a case (grayed nodes do not have to be activated) abc ab bc ac a b c abc ab bc DFmax G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
QDI: Retrieval 3 ?abc nothing • Single term index is generated and ac is indexed • Process abc • Probe Pabc • Probe PabPbc and Pac – obtain the result for ac • Probe Pb and obtain the result for b • Contact all peers in the list to re-rank the obtained results w.r.tabc • Output top-10 • Inc. the QF for ab, bcand ac ?abc peer ?abc ab abc c a ac bc b +1 +1 +1 nothing nothing G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
QDI: Summary • Each single-term found in the document collection has to be indexed. • We call all single-term keys a basic single term index. • The posting lists are truncated at DFmax. • A multi-term key k is activated (indexed) iff: • k is popular: QF(k) ≥QFmin, where QF(k) is the popularity of the key k derived from the available query log and QFminis a parameter for our model (popularity filter). • k contains from 2 to smax terms: 2≤|k|≤ smax, where smax is a parameter of our model (size filter). • all immediate sub-keys of k (of size |k-1|) are indexed and their associated postings lists are truncated (redundancy filter). G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
Scalability (see Skobeltsyn et al, Infoscale’07) • The retrievaltraffic is still scalable (depends on DFmax and a query size) • The indexingtraffic now depends on the number of activated keys • The number of keys does not depend on the document collection size but only on the size of the query log • We can use the QFminandDFmaxparameter to adjust the tradeoff: indexing traffic <—> retrieval quality G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
Contents • Introduction • Single-term vs. multi term indexing • HDK approach for indexing • Query-driven approach for indexing/retrieval • Indexing structure • Example • Scalability • Evaluation • Conclusion G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
AOL logs • 17M Queries from March, April, May 2006 (92 days) • 650K anonymous user sessions • Extracted all unique queries from each user session: … 2006-05-31 23:50:30 wearthbow.com native.cheyenne origin. 2006-05-31 23:50:30 l6 screensaver 2006-05-31 23:50:30 horses for sale in tn ky 2006-05-31 23:50:30 bank of america.com 2006-05-31 23:50:30 ask 2006-05-31 23:50:29 del rosa lanes 2006-05-31 23:50:28 www.spirit airlines.com 2006-05-31 23:50:28 find holy women of the bible 2006-05-31 23:50:27 trains 2006-05-31 23:50:27 todaysmiricles 2006-05-31 23:50:27 constition 2006-05-31 23:50:26 german grocceries in las vegas nv 2006-05-31 23:50:25 porn 2006-05-31 23:50:25 northwest indiana 2006-05-31 23:50:24 united.eprize.net 2006-05-31 23:50:24 jessica laguna … <-0.7Gb G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
Overlap experiment • Use the query-log to build the index (days 1..91) • Choose randomly 2K test queries from the day 92 • Answer each test query with Google and compare to the union of top-DFmax Google results for each of its combinations that areindexed according to the logs. • Mimics our P2PIR system if Google’s ranking is used. • Example: Non-superfluous (indexed) combinations Original query X X overlap@5=3/5=60% G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
Overlap example • Cut-n-paste from the simulation log: >id=481,q=“what did babe ruth do in the 1920” “1920 babe ruth”, qf=0---->Ov@100=100% “1920 babe”, qf=0--------->Ov@100= 9% +++“1920 ruth”, qf=1--------->Ov@100=33% +++“babe ruth”, qf=495 ------->Ov@100= 69% ---“1920”, qf=716 ------------>Ov@100= 1% ---“babe”, qf=3196 ----------->Ov@100= 2% ---“ruth”, qf=1653 ----------->Ov@100= 7% Size: 192, Keys used: 2, Overlap@100: 94% G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
Overlap experiment: impact of QFmin, DFmax Top20 overlap: impact of QFminwith DFmax=600 Top20 overlap: impact of DFmax with QFmin=1 G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
Overlap experiment: impact of the log size Top20 overlap: impact of the log size (Qfmin=1, DFmax=600) G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
TREC Experiment • WT10G collection (~1.69 M docs) • 100 TREC queries (from TREC Web Track 9 & 10) • Query statistics generated form 17M AOL queries • Using Okapi-BM25 weighting schema to compute ranking score • QFmin = 1, 3, 5, ∞ • DFmax= 100, 500 • smax=3 TREC: Precision at Top Ranked Pages (table) Precision is similar to centralized indexing G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
Conclusions • We presented the query-driven indexing strategy for scalable web text retrieval with structured P2P networks: • Keeps only popularand non-redundant multi-term keys in the index • Associates keys with truncatedposting lists • And we showed that: • With real query-logs our approach achieves good retrieval quality comparable to a centralized engine • The QFmin parameter alows to adjust the traffic/quality tradeoff G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index
Last slide Thank you for your attention! Questions? AlvisP2P web site: http://globalcomputing.epfl.ch/alvis/ G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index