250 likes | 356 Views
On the feasibility of Peer-to-Peer Web Indexing and Search. J. Li, B.Loo, J.Hellerstein, M. Kaashoek, D. Karger, R. Morris Presented by: Ranjit R. Briefly. P2P full text keyword search. Two classes of keyword search: Flooding. Intersection of index lists in DHTs. Feasibility analysis:
E N D
On the feasibility of Peer-to-Peer Web Indexing and Search J. Li, B.Loo, J.Hellerstein, M. Kaashoek, D. Karger, R. Morris Presented by: Ranjit R.
Briefly • P2P full text keyword search. • Two classes of keyword search: • Flooding. • Intersection of index lists in DHTs. • Feasibility analysis: • P2P networks cannot make naïve use of either of above techniques due to resource constraints. • Paper presents: • Optimizations & compromises for P2P search on DHTs. • Performance. • Concludes that these optimizations help in bringing the problem to within an order of magnitude of feasibility. • Bring down costs to an optimistic budget.
Motivation for P2P Web search • Stress test for P2P infrastructures. • Resistant to censoring. • Robustness. • Infeasibility of existing P2P keyword-based search systems: • Gnutella, KaZaA. • Both use flooding performance problems (refer [6]). • DHTs [17] proposes a full-text keyword search on 105 documents (but there are 5.5 x 109 documents on the web [5]). • Key question: • Will P2P Web search work?
Issues to be pondered on • Size of the problem • Size of Web index? • Rate of submission of Web search queries? • Resource constraints • Communications costs. • Available disk space on peers. • Goals of this paper: • Evaluate fundamental costs of and constraints on P2P Web search.
Basics of Web search • Inverted index: • Two parts: • Index of terms: • Sorted order of distinct list of terms in document collection. • Posting list for each term: • List of documents that contain the term. • Complexities: • Search: O (log N), N is the number of terms.
Basics of Web search (Contd.) • Consider the following two documents: • D1: The GDP increased 2 percent this quarter. • D2: The spring economic slowdown continued to spring downwards this quarter. • An inverted index for these two documents is given below: • 2 [D1] • continued [D2] • downwards [D2] • economic [D2] • GDP [D1] • increased [D1] • percent [D1] • quarter [D1] [D2] • slowdown [D2] • spring [D2] • the [D1] [D2] • this [D1] [D2] • to [D2]
Basics of Web search (Contd.) • Rankings: • Importance of documents. • Frequency of the search terms in the doc. • How close the terms occur to each other within the documents.
Constraints • Workload • Google indexes ~ 3 billion Web docs; 1000 queries per second ([2]). • ~ 1000 words per doc. • Keys (doc-IDs) in the index = 3 x 1012. • DHT scenario: • Each key = SHA1(content of term) which is 20 bytes. • Total inverted index size = 6 x 1013 bytes.
Constraints (Contd.) • Fundamental constraints: • Storage constraints: • How much storage per peer to store part of the index? • ~ 1GB/peer 60,000 PCs (with no compression of the index). • Communication constraints: • What is an optimistic communication cost per query? • Total bandwidth consumed by queries ≤ Internet’s capacity. • 100 gigabits in US Internet backbones in 1999 [7]. • 1,000 queries per second. • % Internet capacity consumed by Web search = 10. • From above data, per query 10 megabits ~ 1 MB.
Naïve Approaches • Naïve implementations of P2P text search: • Partition-by-document. • Partition-by-keyword. • Partition-by-document: • E.g. Gnutella, KaZaA. • Divide documents among hosts. • Each peer maintains local inverted index of documents it is responsible for. • Query approach: flooding. Return highly ranked doc(s). • Cost: • 60,000 peers. • Flood to each peer 60,000 packets. • Each packet 100 bytes. • Total bandwidth consumed = 6 MB
Naïve Approaches (Contd.) • Partition-by-keyword: • Responsibility for words divided among peers. • i.e. each peer stores the posting list for word(s) it is responsible for. • Query for one or more terms implies postings be sent over the network. • Two-term queries: • Smaller posting sent to holder of larger posting. Perform intersections and return highly ranked doc(s).
Naïve Approaches (Contd.) • Partition-by-keyword: • Cost: • 81,000 queries to search engine at mit.edu • 40% one, 35% two, 25% three or more terms. • mit.edu has 1.7 million web pages. • Average query moved 300,000 bytes of postings over the network. • Scaling to the size of the Web indexed by Google (3 billion pages) • 530 MB of postings moved.
Naïve Approaches (Contd.) • Improve upon which approach? • Partition-by-document • bandwidth/query = 6 MB, or • Partition-by-keyword? • bandwidth/query = 530 MB • Authors chose: • Partition-by-keyword (530 MB) • Reason: • To capitalize on vast research on inverted index intersection.
Optimizations for Partition-by-keyword • Scenario: • Query trace of 81,000 queries on a data set of 1.7 million web pages from mit.edu • Caching and Pre-computation • Caching: • To avoid receiving postings for same queries again. • Reduced communication costs by 38%. • Pre-computation: • Computing/storing intersection of different posting lists in advance. • Not feasible to compute intersections of all term pairs. • Compute intersection of all pairs of popular query terms (Zipf). • Savings: 50%.
Optimizations (Contd.) • Compression: • Reduce communication cost. • Approaches: • Bloom filters. • Gap compression. • Adaptive Set Intersection. • Clustering.
Optimizations (Contd.) • Bloom filters (Intro.) • A probabilistic algorithm to quickly test membership in a large set using multiple hash functions into a single array of bits.
Optimizations (Contd.) • Bloom filters (Into.) • Network applications of Bloom Filters: A Survey. Broder etal.
Optimizations (Contd.) • Bloom filters • Efficient Peer-to-Peer Keyword Searching. Reynolds etal
Optimizations (Contd.) • Bloom filters: • Represent a set compactly. • Probability of false positives. • Two-round Bloom intersection. • Compression ratio of 13. • Four-round Bloom intersection. • Compression ratio of 40. (More the number of rounds the lesser the false positives provided the intersection set is small.) • Compressed Bloom filters • Compression ratio of 50.
Optimizations (Contd.) • Gap compression (GC) • Periodically remap docIDs from 160-bits to numbers from 1 to num. of docs. • E.g. D-Gap compression: bit blocks • 0001000111001111 • {3, 7, 8, 9, 12, 13, 14, 15, 16 } • {[0], 3, 1, 3, 3, 2, 4} • {[0], 2, 3, 6, 9, 11, 15} • GAP(N) = GAP(N-1) + Length(GAP(N)) • Fibonacci numbers • http://bmagic.sourceforge.net/dGap.html • Compression ratio of 30.
Optimization (Contd.) • Adaptive set intersection (AS) • Avoid transfer of posting lists by exploiting their structure. • E.g. • Intersection of A={1, 3, 4, 7} and B={8, 10, 20, 30} • 7 < 8, thus AB= • A={1, 4, 8, 20} and B={3, 7, 10, 30} • 20 < 3. Requires the transfer of A. • Compression ratio of 40 (with GC). • Clustering • Statistical clustering techniques to group similar documents • Achieves compression ratio of 75x with GC and AS
Compromises • Max reduction in comm. costs 75x using optimizations. • Another 7x improvement through compromising quality of results and structure. (Target = 530x reduced cost.) • Compromising result quality: • Incremental intersection (fig. from Reynolds etal.) plus ranking functions for results.
Compromises (Contd.) • Compromising P2P structure: • Exploit Internet aggregate bandwidth, e.g. by replicating entire inverted index with one copy per ISP.
Conclusion • Feasibility analysis for P2P web search. • Naïve search implementation not feasible. • Obtain feasibility through a combination of optimizations and compromises.