250 likes | 561 Views
Infeasibility of existing P2P keyword-based search systems: ... DHTs [17] proposes a full-text keyword search on 105 documents (but there are 5.5 x 109 documents on the web ...
E N D
Slide 1:On the feasibility of Peer-to-Peer Web Indexing and Search
J. Li, B.Loo, J.Hellerstein, M. Kaashoek, D. Karger, R. Morris Presented by: Ranjit R.
Slide 2:Briefly
P2P full text keyword search. Two classes of keyword search: Flooding. Intersection of index lists in DHTs. Feasibility analysis: P2P networks cannot make naļve use of either of above techniques due to resource constraints. Paper presents: Optimizations & compromises for P2P search on DHTs. Performance. Concludes that these optimizations help in bringing the problem to within an order of magnitude of feasibility. Bring down costs to an optimistic budget.
Slide 3:Motivation for P2P Web search
Stress test for P2P infrastructures. Resistant to censoring. Robustness. Infeasibility of existing P2P keyword-based search systems: Gnutella, KaZaA. Both use flooding ? performance problems (refer [6]). DHTs [17] proposes a full-text keyword search on 105 documents (but there are 5.5 x 109 documents on the web [5]). Key question: Will P2P Web search work?
Slide 4:Issues to be pondered on
Size of the problem Size of Web index? Rate of submission of Web search queries? Resource constraints Communications costs. Available disk space on peers. Goals of this paper: Evaluate fundamental costs of and constraints on P2P Web search.
Slide 5:Basics of Web search
Inverted index: Two parts: Index of terms: Sorted order of distinct list of terms in document collection. Posting list for each term: List of documents that contain the term. Complexities: Search: O (log N), N is the number of terms.
Slide 6:Basics of Web search (Contd.)
Consider the following two documents: D1: The GDP increased 2 percent this quarter. D2: The spring economic slowdown continued to spring downwards this quarter. An inverted index for these two documents is given below: 2 ? [D1] continued ? [D2] downwards ? [D2] economic ? [D2] GDP ? [D1] increased ? [D1] percent ? [D1] quarter ? [D1] ? [D2] slowdown ? [D2] spring ? [D2] the ? [D1] ? [D2] this ? [D1] ? [D2] to ?[D2]
Slide 7:Basics of Web search (Contd.)
Rankings: Importance of documents. Frequency of the search terms in the doc. How close the terms occur to each other within the documents.
Slide 8:Constraints
Workload Google indexes ~ 3 billion Web docs; 1000 queries per second ([2]). ~ 1000 words per doc. Keys (doc-IDs) in the index = 3 x 1012. DHT scenario: Each key = SHA1(content of term) which is 20 bytes. Total inverted index size = 6 x 1013 bytes.
Slide 9:Constraints (Contd.)
Fundamental constraints: Storage constraints: How much storage per peer to store part of the index? ~ 1GB/peer ? 60,000 PCs (with no compression of the index). Communication constraints: What is an optimistic communication cost per query? Total bandwidth consumed by queries = Internets capacity. 100 gigabits in US Internet backbones in 1999 [7]. 1,000 queries per second. % Internet capacity consumed by Web search = 10. From above data, per query 10 megabits ~ 1 MB.?
Slide 10:Naļve Approaches
Naļve implementations of P2P text search: Partition-by-document. Partition-by-keyword. Partition-by-document: E.g. Gnutella, KaZaA. Divide documents among hosts. Each peer maintains local inverted index of documents it is responsible for. Query approach: flooding. Return highly ranked doc(s). Cost: 60,000 peers. Flood to each peer ? 60,000 packets. Each packet 100 bytes. Total bandwidth consumed = 6 MB ?
Slide 11:Naļve Approaches (Contd.)
Partition-by-keyword: Responsibility for words divided among peers. i.e. each peer stores the posting list for word(s) it is responsible for. Query for one or more terms implies postings be sent over the network. Two-term queries: Smaller posting sent to holder of larger posting. Perform intersections and return highly ranked doc(s).
Slide 12:Naļve Approaches (Contd.)
Partition-by-keyword: Cost: 81,000 queries to search engine at mit.edu 40% one, 35% two, 25% three or more terms. mit.edu has 1.7 million web pages. Average query moved 300,000 bytes of postings over the network. Scaling to the size of the Web indexed by Google (3 billion pages) ? 530 MB of postings moved. ?
Slide 13:Naļve Approaches (Contd.)
Improve upon which approach? Partition-by-document bandwidth/query = 6 MB, or Partition-by-keyword? bandwidth/query = 530 MB Authors chose: Partition-by-keyword (530 MB) Reason: To capitalize on vast research on inverted index intersection.
Slide 14:Optimizations for Partition-by-keyword
Scenario: Query trace of 81,000 queries on a data set of 1.7 million web pages from mit.edu Caching and Pre-computation Caching: To avoid receiving postings for same queries again. Reduced communication costs by 38%. Pre-computation: Computing/storing intersection of different posting lists in advance. Not feasible to compute intersections of all term pairs. Compute intersection of all pairs of popular query terms (Zipf). Savings: 50%.
Slide 15:Optimizations (Contd.)
Compression: Reduce communication cost. Approaches: Bloom filters. Gap compression. Adaptive Set Intersection. Clustering.
Slide 16:Optimizations (Contd.)
Bloom filters (Intro.) A probabilistic algorithm to quickly test membership in a large set using multiple hash functions into a single array of bits.
Slide 17:Optimizations (Contd.)
Bloom filters (Into.) Network applications of Bloom Filters: A Survey. Broder etal.
Slide 18:Optimizations (Contd.)
Bloom filters Efficient Peer-to-Peer Keyword Searching. Reynolds etal
Slide 19:Optimizations (Contd.)
Bloom filters: Represent a set compactly. Probability of false positives. Two-round Bloom intersection. Compression ratio of 13. Four-round Bloom intersection. Compression ratio of 40. (More the number of rounds the lesser the false positives provided the intersection set is small.) Compressed Bloom filters Compression ratio of 50.
Slide 20:Optimizations (Contd.)
Gap compression (GC) Periodically remap docIDs from 160-bits to numbers from 1 to num. of docs. E.g. D-Gap compression: bit blocks 0001000111001111 {3, 7, 8, 9, 12, 13, 14, 15, 16 } {[0], 3, 1, 3, 3, 2, 4} {[0], 2, 3, 6, 9, 11, 15} GAP(N) = GAP(N-1) + Length(GAP(N)) Fibonacci numbers http://bmagic.sourceforge.net/dGap.html Compression ratio of 30.
Slide 21:Optimization (Contd.)
Adaptive set intersection (AS) Avoid transfer of posting lists by exploiting their structure. E.g. Intersection of A={1, 3, 4, 7} and B={8, 10, 20, 30} 7 < 8, thus A?B=? A={1, 4, 8, 20} and B={3, 7, 10, 30} 20 < 3. Requires the transfer of A. Compression ratio of 40 (with GC). Clustering Statistical clustering techniques to group similar documents Achieves compression ratio of 75x with GC and AS
Slide 22:Optimizations (Contd.)
Slide 23:Compromises
Max reduction in comm. costs 75x using optimizations. Another 7x improvement through compromising quality of results and structure. (Target = 530x reduced cost.) Compromising result quality: Incremental intersection (fig. from Reynolds etal.) plus ranking functions for results.
Slide 24:Compromises (Contd.)
Compromising P2P structure: Exploit Internet aggregate bandwidth, e.g. by replicating entire inverted index with one copy per ISP.
Slide 25:Conclusion
Feasibility analysis for P2P web search. Naļve search implementation not feasible. Obtain feasibility through a combination of optimizations and compromises.