Query-Driven Indexing for P2P Text Retrieval

The Future of Web Search 19.07.2007 Bertinoro, Italy Alvis Query-Driven Indexing forP2P Text Retrieval Gleb Skobeltsyn EPFL, Switzerland June 19, 2007 • Joint work with: • Toan Luu • Ivana Podnar Žarko • Martin Rajman • Karl Aberer G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Goal • Our goalis to achieve scalable full-text retrieval with structured P2P networks (DHTs) • Each peer: • Provides resources (bandwidth, storage) • Searches the whole network • Publishes its own documents DHT G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

K I K I K I K I K I K I K I K I K I h(“gleb”)-{d2,d3} h(“epfl”)-{d1,d2} h(t’)-{d4,d5} {d1,d2} {d2} Naïve (single-term) approach ... is to distribute the global inverted index in a DHT using term partitioning: This slide was borrowed from B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, I. Stoica presentation: Enhancing P2P File-Sharing with an Internet-Scale Query Processor Query: “epfl & gleb” G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Single-term vs. multi-term P2P indexing voc. sizecould growexponentially! How to choose keys to keep a satisfactory retrieval quality? G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Multi-term indexing: framework • Each peer is responsible for a set of keys assigned by the underlying DHT using the standard hashing mechanism • Each keycorresponds to a term or a set of terms • Each key is assigned to a truncated posting list (TPL) that stores at most DFmax top-ranked document references • Distributed index contains {key,TPL} pairs • The indexing load is handled by an optimizedDHT layer: • F. Klemm, J.-Y. Le Boudec, D. Kostic, K. Aberer • Improving the Throughput of Distributed Hash Tables Using Congestion-Aware Routing, in IPTPS'07 G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Single-term vs. multi-term P2P indexing voc. sizecould growexponentially! How to choose keys to keep a satisfactory retrieval quality? G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Multi-term indexing techniques • Indexing with Highly Discriminative Keys (HDKs), based on: • Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys I. Podnar, M. Rajman, T. Luu, F. Klemm, K. Aberer in ICDE’07 • Beyond term indexing: A P2P framework for Web information retrieval I. Podnar, M. Rajman, T. Luu, F. Klemm, K. Aberer Informatica, vol. 30, no. 2, 2006. • Query-Driven Indexing (QDI), based on: • Web Text Retrieval with a P2P Query-Driven IndexG. Skobeltsyn, T. Luu, I. Podnar Žarko, M. Rajman, K. Abererin SIGIR’07 • Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval G. Skobeltsyn, T. Luu, I. Podnar Žarko, M. Rajman, K. Aberer in Infoscale’07 G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Indexing with HDK • Data-Driven key generation: • Each time a new document is indexed, some pos-ting lists for a key k can reach the max size of DFmax • It triggersthe generation of new keys (k + other frequent keys) • Use a number of filters to reduce the number of keys, e.g.: • Proximity Filter: a document qualifies for a key t1&t2 if t1 is close to t2 (specified by a window size w). G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Indexing with HDK • Pro’s: • ICDE’07 paper proves that the number of keys grows linearly • Elegant key generation mechanism • Low bandwidth while query processing (PL’s of limited size) • Con’s: • Practically the number of keys is LARGE: 68M for 0.6M docs • High bandwidth consumption at indexing • Problem: • Too many keys are superfluous (almost never used) G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Query Driven Indexing Lets index only what is queried! G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Contents • Introduction • Single-term vs. multi term indexing • HDK approach for indexing • Query-driven approach for indexing/retrieval • Indexing structure • Example • Scalability • Evaluation • Conclusion G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Query-Driven Index (QDI) • Query-Driven Indexing strategy solves the “Too-Many-Keys” problem: • Avoids maintenance of superfluous keys • Generates only such keys that are requested by users • Utilizes query-log to discover such keys • Problems • Indexing of a new key requires a bandwidth-efficient mechanism to obtain the top-k posting list associated with the key • Smart Broadcast (ONM) or • Conventional intersection like TA, but less frequent • Incomplete index causes degradation of query results quality • Show that the degradation is low G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Which keys to index? • Each single-term found in the document collection has to be indexed. • We call all single-term keys a basic single term index. • The posting lists are truncated at DFmax. • A key k is non-superfluous and can be activated iff: • k is popular: QF(k) ≥QFmin, where QF(k) is the popularity of the key k derived from the available query log and QFminis a parameter for our model (popularity filter). • k contains from 2 to smax terms: 2≤|k|≤ smax, where smax is a parameter of our model (size filter). • all immediate sub-keys of k (of size |k-1|) are indexed and their associated postings lists are truncated (redundancy filter). G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

QDI: Retrieval ?abc nothing • Single term index is generated • Process abc • Probe Pabc • Probe PabPbc and Pac • Probe PaPb and Pc • Obtain top-DFmax results for a,b and c(ranked w.r.t a,b and c respectively) • Contact peers in the list, re-rank the obtained results w.r.t abc • Output top-10 • Inc. the QF for ab, bc and ac • Activate (index) ac ?abc peer ?abc b ab ac bc a c abc +1 +1 +1 popular nothing nothing nothing DFmax G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

QDI: Retrieval 2 • Assume the frequency of b is below DFmax • Note, how the redundancy filter would simplify the lattice in such a case (grayed nodes cannot be activated) abc ab bc ac a b c abc ab bc DFmax G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

QDI: Retrieval 3 ?abc nothing • Single term index is generated and ac is indexed • Process abc • Probe Pabc • Probe PabPbc and Pac – obtain the result for ac • Probe Pb and obtain the result for b • Contact all peers in the list to re-rank the obtained results w.r.tabc • Output top-10 • Inc. the QF for ab, bcand ac ?abc peer ?abc ab abc c a ac bc b +1 +1 +1 nothing nothing G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Scalability • The retrieval traffic is bounded by a constant due to trun-cated posting lists (depends on DFmax and a query size) • The indexing traffic depends on the number of keys to be activated. • The number of keys in the HDK approach (UPPER BOUND) is proven to grow linearly with the number of peers, if each peer provides a limited number of documents • The number of keys does not depend on the document collection size but only on the size of the query log • We can use the QFmin parameter to adjust the tradeoff: indexing traffic <-> retrieval quality G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Contents • Introduction • Single-term vs. multi term indexing • HDK approach for indexing • Query-driven approach for indexing/retrieval • Indexing structure • Example • Scalability • Evaluation • Conclusion G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

AOL logs • 17M Queries from March, April, May 2006 (92 days) • 650K anonymous user sessions • Extracted all unique queries from each user session: … 2006-05-31 23:50:30 wearthbow.com native.cheyenne origin. 2006-05-31 23:50:30 l6 screensaver 2006-05-31 23:50:30 horses for sale in tn ky 2006-05-31 23:50:30 bank of america.com 2006-05-31 23:50:30 ask 2006-05-31 23:50:29 del rosa lanes 2006-05-31 23:50:28 www.spirit airlines.com 2006-05-31 23:50:28 find holy women of the bible 2006-05-31 23:50:27 trains 2006-05-31 23:50:27 todaysmiricles 2006-05-31 23:50:27 constition 2006-05-31 23:50:26 german grocceries in las vegas nv 2006-05-31 23:50:25 porn 2006-05-31 23:50:25 northwest indiana 2006-05-31 23:50:24 united.eprize.net 2006-05-31 23:50:24 jessica laguna … <-0.7Gb G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Distribution of combinations in the AOL logs G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

TREC Experiment • WT10G collection (~1.69 M docs) • 100 TREC queries (from TREC Web Track 9 & 10) • Query statistics generated form 17M AOL queries • Using Okapi-BM25 weighting schema to compute ranking score • QFmin = 1, 3, 5, ∞ • DFmax= 100, 500 • smax=3 TREC: Precision at Top Ranked Pages (table) Precision is similar to centralized indexing G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Overlap experiment • Use the query-log to build the index (days 1..91) • Choose randomly 2K test queries from the day 92 • Answer each test query with Google and compare to the union of top-DFmax Google results for each of its combinations that areindexed according to the logs. • Mimics our P2PIR system if Google’s ranking is used. • Example: Non-superfluous (indexed) combinations Original query X X overlap@5=3/5=60% G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Overlap example • Cut-n-paste from the simulation log: >id=481,q=“what did babe ruth do in the 1920” “1920 babe ruth”, qf=0---->Ov@100=100% “1920 babe”, qf=0--------->Ov@100= 9% +++“1920 ruth”, qf=1--------->Ov@100=33% +++“babe ruth”, qf=495 ------->Ov@100= 69% ---“1920”, qf=716 ------------>Ov@100= 1% ---“babe”, qf=3196 ----------->Ov@100= 2% ---“ruth”, qf=1653 ----------->Ov@100= 7% Size: 192, Keys used: 2, Overlap@100: 94% G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Google experiment: impact of smax, DFmax impact of Smaxfor all possible combinations (QFmin=0) Impact of DFmax with QFmin=1, Smax =3 G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Google experiment: impact of QFmin • Does not depend • on the document • collection size • HDK approach • would require • ~65M keys for • 650K documents • >30% of badly performing queries are misspells => real quality is higher impact of QFmin(DFmax=600) Number of keys for different QFmin G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Google experiment: impact of the log size impact of the log size (Qfmin=1, DFmax=600) G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Conclusions • We presented the query-driven indexing strategy for scalable web text retrieval with structured P2P networks: • Stores posting lists in a DHT for terms andterm combinations • Stores at most DFmax top document references in a posting list • Efficiently collects the query statisticsin a distributed fashion • Based on this statistics activates (indexes) only popularkeys • Computes the result of a multi-term query based only on the index entries available at the moment – nocostly intersections • We also showed that: • With real query-logs our approach achieves good retrieval quality • The QFmin parameter adjusts the traffic/quality tradeoff G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Last slide Thank you for your attention! Questions? AlvisP2P - to appear in July at http://globalcomputing.epfl.ch/alvis/ G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Query-Driven Indexing for P2P Text Retrieval