Query-Driven Indexing for Scalable P2P Text Retrieval

Infoscale’07, June 6-8, 2007 Suzhou, China Alvis Query-Driven Indexing forScalable P2P Text Retrieval Gleb Skobeltsyn EPFL, Switzerland June 6, 2007 • Joint work with: • Toan Luu • Ivana Podnar Žarko • Martin Rajman • Karl Aberer G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

Goal • Our goalis to achieve scalable full-text retrieval with structured P2P networks (DHTs) • Each peer: • Provides resources (bandwidth, storage) • Searches the whole network • Publishes its own documents DHT G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

K I K I K I K I K I K I K I K I K I h(“gleb”)-{d2,d3} h(“epfl”)-{d1,d2} h(t’)-{d4,d5} {d1,d2} {d2} Naïve (single-term) approach ... is to distribute the global inverted index in a DHT: This slide was borrowed from B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, I. Stoica presentation: Enhancing P2P File-Sharing with an Internet-Scale Query Processor Query: “epfl & gleb” G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

Indexing with Highly Discriminative Keys [1] Scalable Peer-to-Peer Web Retrieval withHighly Discriminative Keys I. Podnar, M. Rajman, T. Luu, F. Klemm, K. Aberer in ICDE’07, Istambul, Turkey G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

Indexing with HDKs: main properties • Distributed index contains {key,PL} pairs: • Each keycorresponds to a term or a set of terms • Each key is assigned to a posting list • Each posting list stores at most DFmax top-ranked document references. • Data-Driven key generation: • Each time a new document is indexed, some posting lists for a key k can reach the max size of DFmax • It triggersthe generation of new keys (k + other frequent keys) • Proximity Filter: a document qualifies for a key t1&t2 if t1 is close to t2 (specified by a window size w). G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

HDK – exhaustive data driven indexing • Pro’s: • ICDE’07 paper proves that the number of keys grows linearly • Elegant key generation mechanism • Low bandwidth while query processing (PL’s of limited size) • Con’s: • Practically the number of keys is LARGE: 68M for 0.6M docs • High bandwidth consumption at indexing • Problem: • Too many keys are superfluous (almost never used) G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

Query Driven Indexing Lets index only what is queried! G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

Contents • Introduction • HDK approach for indexing • Query-driven approach for indexing/retrieval • Indexing structure • Example • ONM • Scalability • Evaluation • Conclusion G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

Query-Driven Index (QDI) • Query-Driven Indexing strategy solves the “Too-Many-Keys” problem: • Avoids maintenance of superfluous keys • Generates only such keys that are requested by users • Utilizes query-log to discover such keys • Problems • Indexing of a new key requires a bandwidth-efficient mechanism to obtain the top-k posting list associated with the key • Opportunistic Notification Mechanism (smart-broadcast) • Incomplete index causes degradation of query results quality • Show that the degradation is low G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

Which keys to index? • Each single-term found in the document collection is has to be indexed. • We call all single-term keys a basic single term index. • The posting lists are truncated at DFmax. • A key k is non-superfluous and can be activated iff: • k is popular: QF(k) ≥QFmin, where QF(k) is the popularity of the key k derived from the available query log and QFminis a parameter for our model (popularity filter). • k contains from 2 to smax terms: 2≤|k|≤ smax, where smax is a parameter of our model (size filter). • all immediate sub-keys of k (of size |k-1|) are indexed and their associated postings lists are truncated (redundancy filter). G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

QDI: Retrieval ?abc nothing • Single term index is generated • Process abc • Probe Pabc • Probe PabPbc and Pac • Probe PaPb and Pc • Obtain top-DFmax results for a,b and c(ranked w.r.t a,b and c respectively) • Contact peers in the list, re-rank the obtained results w.r.t abc • Output top-10 • Inc. the QF for ab, bc and ac • Activate (index) ac ?abc peer ?abc b ab ac bc a c abc +1 +1 +1 popular nothing nothing nothing DFmax G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

QDI: Retrieval 2 • Assume the frequency of b is below DFmax • Note, how the redundancy filter would simplify the lattice in such a case (grayed nodes cannot be activated) abc ab bc ac a b c abc ab bc DFmax G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

QDI: Retrieval 3 ?abc nothing • Single term index is generated and ac is indexed • Process abc • Probe Pabc • Probe PabPbc and Pac – obtain the result for ac • Probe Pb and obtain the result for b • Contact all peers in the list to re-rank the obtained results w.r.tabc • Output top-10 • Inc. the QF for ab, bcand ac ?abc peer ?abc ab abc c a ac bc b +1 +1 +1 nothing nothing G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

Opportunistic Notification Mechanism • ONM used to activate a new multi-term key • ONM is a “smart” broadcast with the following features: • It is based on the shower multicast [2]: each peer within a specified range is contacted only once • Notifications are small and low-priority => piggybacking • Broadcast is split into several multicast sessions, each time pruning low-score documents • It uses the high-performance DHT layer [3] [2]A. Datta, M. Hauswirth, R. Schmidt, R. John, K. Aberer: Range Queries in Tree-Structured Overlays, in P2P’05 [3]F. Klemm, J.-Y. Le Boudec, D. Kostic, K. Aberer: Improving the Throughput of Distributed Hash Tables Using Congestion-Aware Routing, in IPTPS'07 G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

Scalability • The retrieval traffic is bounded by a constant due to trun-cated posting lists (depends on DFmax and a query size) • The indexing traffic depends on the number of keys to be activated. • The number of keys in the HDK approach (UPPER BOUND) is proven to grow linearly with the number of peers, if each peer provides a limited number of documents • The number of keys does not depend on the document collection size but only on the size of the query log • We can use the QFmin parameter to adjust the tradeoff: indexing traffic <-> retrieval quality G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

Contents • Introduction • HDK approach for indexing • Query-driven approach for indexing/retrieval • Indexing structure • Example • ONM • Scalability • Evaluation • Conclusion G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

Overlap experiment • Use the Wikipedia query-log (9M queries/9-10.2004) to build the index • Choose randomly 3K test queries • Answer each test query with Google and compare to the union of top-DFmax Google results for each of its combinations that areindexed according to the logs. • Mimics our P2PIR system if Google’s ranking is used. • Example: Non-superfluous (indexed) combinations Original query X X overlap@5=3/5=60% G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

Overlap example • Cut-n-paste from the simulation log: >id=481,q=“what did babe ruth do in the 1920” “1920 babe ruth”, qf=0---->Ov@100=100% “1920 babe”, qf=0--------->Ov@100= 9% +++“1920 ruth”, qf=1--------->Ov@100=33% +++“babe ruth”, qf=495 ------->Ov@100= 69% ---“1920”, qf=716 ------------>Ov@100= 1% ---“babe”, qf=3196 ----------->Ov@100= 2% ---“ruth”, qf=1653 ----------->Ov@100= 7% Size: 192, Keys used: 2, Overlap@100: 94% G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

Overlap with Google G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

Overlap with Yahoo G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

Overlap with Google (no/partial/full overlap) G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

P2P Index Simulations • Number of keys depends only on the query log size and QFmin! • Does not depend on the collection size! • Number of keys is much smaller than for the HDK approach: 68M keys for 650K doc G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

Real query logs? • Wikipedia queries are unrealistic (too skewed) as users know what they want. • Real web-queries might • perform worse? • Large scale experiments • with real web queries and • the TREC collection in [4] • [4] Web Text Retrieval with a P2P Query-Driven IndexG. Skobeltsyn, T. Luu, I. Podnar Žarko, M. Rajman, K. AbererTo appear in SIGIR’07 G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

Conclusions • We presented the query-driven indexing strategy for scalable web text retrieval with structured P2P networks: • Stores posting lists in a DHT for terms andterm combinations • Stores at most DFmax top document references in a posting list • Efficiently collects the query statisticsin a distributed fashion • Based on this statistics activates (indexes) only popularkeys • Computes the result of a multi-term query based only on the index entries available at the moment – nocostly intersections • We also showed that: • With real query-logs our approach achieves good retrieval quality • The QFmin parameter adjusts the traffic/quality tradeoff G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

Last slide Thank you for your attention! Questions? G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

Query-Driven Indexing for Scalable P2P Text Retrieval