350 likes | 369 Views
Seminar “Peer-to-peer Information Systems”. PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities. F. M. Cuenca-Acuna, C. Peery, R. P. Martin, and T. D. Nguyen. Speaker: Sergey Lugin. January 2003. Outline. Introduction 2. Architecture
E N D
Seminar “Peer-to-peer Information Systems” PlanetP: Using Gossiping to Build Content Addressable Peer-to-PeerInformation Sharing Communities F. M. Cuenca-Acuna, C. Peery, R. P. Martin, and T. D. Nguyen Speaker: Sergey Lugin January 2003
Outline • Introduction • 2. Architecture • local and global index • gossiping algorithm • content search and ranking • 3. Performance • content search and ranking • gossiping algorithm • 4. Group extension • 5. Related works and summary
PlanetP features: • a gossiping layer used to globally replicate an extremely compact content index • a completely distributed content search and ranking algorithm that helps users find the most relevant information Introduction PlanetP is a content addressable publish/ subscribe service for unstructured peer-to-peer (P2P) communities: • no central management • resilience to rapid membership changes • content search and ranking • scaling up to several thousand peers
Outline • Introduction • 2. Architecture • local and global index • gossiping algorithm • content search and ranking • 3. Performance • content search and ranking • gossiping algorithm • 4. Group extension • 5. Related works and summary
Local index Local Files • Contents of peer files (video, music) are described in XML snippets that contain pointers to corresponded files • Each peer runs a Simple Web Server to support peer’s retrieval of these files (to be considered further) • Each peer summarizes the set of unique terms in its local index in a BLOOM FILTER pointers Local index Simple Web Server XML Snippets unique terms BLOOM FILTER: 0 1 0 0 1 1 0 0 1 0 1 0 1 0 1 1
Terms: paper … cat hash functions 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 1 0 1 0 1 0 1 1 yes BLOOM FILTER: BLOOM FILTER: query term:cat hash functions BLOOM FILTER The Bloom filter is a bit-array that allows to quickly test a membership in a large term set using hash functions. • compact form for the representation of a large term set • quickness • only Boolean answer ( Yes / No ) • “false positive” problem
Global index • Global directory describes all peers and all available information • in the community Global directory (index) • Global directoryis replicated everywhere by using gossiping • Gossiping is used also to keep peers synchronized • joining or leaving of peers • new data
PlanetP’s gossiping algorithm: • rumoring algorithm • anti-entropy algorithm • partial anti-entropy algorithm Demer’s algorithm Gossiping algorithm PlanetP uses gossiping to replicate the global index across peers of a P2P community • robustness to the dynamic joining and leaving of peers • independence from any particular subset of peers being on-line
rumor • every Tg seconds, a peer Px pushes this change (a rumor) to a peer Py chosen randomly from the global index rumor Py rumor • if the rumor is new information for the peer Py, then it starts to push this rumor just like Px rumor Rumoring algorithm Purpose:The algorithm provides spreading of new information across a P2P community A peer has a change Px A peer has a change: • the peer Px stops pushing the rumor after it has contacted n consecutive peers that already heard the rumor
pull • every Tr seconds, a peer Pxattempts to pull information from a peer Py chosen randomly from the global index Py • the peer Py returns the summaryof its version of the global index global index summary Px Anti-entropy algorithm Purpose:The algorithm allows to avoid the possibility of rumors dying out before reaching everyone All peers: • then Px can ask Py for any new information that it does not have
rumor • a peer Px pushes a rumor to a peer Py • the peer Py piggybacks the identifiers of a small number of the most recent rumors Py identifiers of recent rumors Px Partial anti-entropy algorithm Purpose:The algorithmallows to reduce the time of new information spreading Extension of each push operation: • then Px can pull any recent rumor that did not reach it The process requires only one extra message exchange in the case that Pyknows something that Px does not
Joining of new peers: • gossiping NEW Leaving of present peers: • a peer discovers that another peer is OFF-LINE when an attempt to communicate with it fails: • the peer status is marked as OFF-LINE • information in the global index is not dropped • if a peer has been marked as OFF-LINE continuously for a time TDead, it is assumed that the peer has left the community permanently: • all information about the peer is dropped OFF-LINE Rejoining of peers: • gossiping (the peer status is marked as ON-LINE) ON-LINE Membership
Content search and ranking • PlanetP implements a Content Ranking Algorithm that • uses the Vector Space Ranking Model • Vector Space Ranking Model • each document and query is abstractly represented as a vector • each dimension is associated with a distinct term • the value of each component of a vector is the weight representing the importance of that term to the corresponding document or query • Relevance of the document and query • the cosine of the angle between the query vector and document vector Q is a query, D is a document, t is a term, wQ,T – the weight of the term t for the query Q, wD,t – the weight of the term t for the document D
Content Ranking Algorithm - Background • TFxIDF is a popular method for term weight assignments • TF is a Term Frequency • how often the term appears in the document • IDF is a Inverse Document Frequency • the inverse of how often this term appears in the entire collection • This technique allows to balance: • the fact that terms frequently used in a document are likely important to describe its meaning • terms that appear in many documents in a collection are not useful for differentiating between these documents
PlanetP’s proposed solution Approximation TFxIDF two sub-problems: 1. Ranking peers according to their likelihood of having relevant documents 2. Deciding on the number of peers to contact and ranking identified documents Approximation TFxIDF • Problems of the computation TFxIDF for a P2P community: • documents are distributed across a P2P community • the peer bandwidth is restricted to use How this technique can be used for a P2P community ?
Ranking of peers Querying Fred Fred Doc - A 0,93 Bob Bob Doc- E Query 0,79 Doc - D 0,77 Tan PlanetP’s Content Search Steps: • Ranking peers 2. Querying the most relevant peers Ranking results Global directory name rank
Ranking peers • 1. Ranking peers • new measure similar IDF Inverse Peer Frequency (IPF) • a term that is present in the bloom filter of every peer is not useful for differentiating between the peers for a particular query • IPF for a term t N – is the number of all peers Nt– is the number of peers having the term t Ranking peers for a query Q:
Quering peers 2. Querying the most relevant peers • Problem: As communities grow, it becomes infeasible to contact large subsets of peers for each query • Solution: for a query Q,the user specifies a limit K on the number of potential documents that should be presented • PlanetP sorts ranked peer lists and contacts peers (the most relevant peers) • Each contacted peer returns a set of document URLs together with their relevance
Ranking results ranked documents name rank Doc - A 0,93 K Doc- E 0,79 p 0,77 Doc - D stop • Simulation results showed that p should be a function of the community size N and K as follows: stopping heuristic: C0, C1, C2 – are constant values Quering peers 2. Querying the most relevant peers • PlanetP stops contacting peers when the documents identified bypconsecutive peers fail to contribute to the top K ranked documents Sorted peer list
Outline • Introduction • 2. Architecture • local and global index • gossiping algorithm • content search and ranking • 3. Performance • content search and ranking • gossiping algorithm • 4. Group extension • 5. Related works and summary
Performance Performance Study: • Content Search • Efficacy • Gossiping • Time • Bandwidth Usage • Performance Study was based on the developed simulator • The simulator was validated against measurements taken from prototype (up to several hundred peers)
Content search and ranking algorithm • Metrics: Recall (R), Precision (P) • Input data: The collection AP89 was extracted from the TREC collection (Associated Press) • No. Docs = 84678 • No. Unique Terms = 129603 • Different document-to-peer distribution: • Uniform (The worst case for a distributed search) • Weibull (7% of the users in the Gnutella community share more files than all the rest together)
Content search and ranking algorithm T.W is a search engine using TFxIDF (centralized implementation) P.W is the PlanteP’s search engine (Weibull distribution of documents) P.U is the PlanteP’s search engine (Uniform distribution of documents) T.W T.W P.W P.W P.U P.U Recall Precision No. documents requested No. documents requested
Content search and ranking algorithm T.W is a search engine using TFxIDF (centralized implementation) P.W is the PlanteP’s search engine (Weibull distribution of documents) P.U is the PlanteP’s search engine (Uniform distribution of documents) community of 400 peers stopping heuristic No. peers contacted Recall Number of peers No. documents requested
Observations • PlanetP tracks the performance of the centralized implementation closely • Performance is independent of how the shared documents are distributed. • PlanetP’s recall and precision is within 11% of TFxIDF’s implementation. 2. PlanetP scales well for communities of up to 1000 peers, maintaining a relatively constant recall and precision. 3. PlanetP’s stopping heuristic allows to maintain the close recall and precision independently of how the documents are distributed.
Gossiping algorithm Measured factor:propagation time Time required to propagate a single Bloom filter everywhere. LAN-AE: Peers use only push anti-entropy: each peer periodically push a summary of its data structure. The target requests all new information from this summary. LAN: Peers use PlanetP’s gossiping algorithm. Network:45 Mbps • LAN-AE • LAN (PlanetP) PlanetP’s parameters Time (sec) No. peers
Gossiping algorithm PlanetP’s gossiping algorithm Measured factor:propagation time, average bandwidth Network:512 Kbps Different gossiping intervals: 10 sec (DSL-10) 30 sec (DSL-30) 60 sec (DSL-60) • DSL- 60 • DSL- 30 • DSL- 10 • DSL- 60 • DSL- 30 • DSL- 10 Average Bandwidth (Bytes/s) Time (sec) No. peers No. peers
Observations 1. The algorithm significantly outperforms ones that use only push anti-entropy for both propagation time and network volume 2. Propagation time is a logarithmic function of community size 3. Total number of bytes sent is very modest, implying that gossiping is very scalable 4. We can easily trade off propagation time against gossiping bandwidth by increasing or decreasing the gossiping interval
Outline • Introduction • 2. Architecture • local and global index • gossiping algorithm • content search and ranking • 3. Performance • content search and ranking • gossiping algorithm • 4. Group extension • 5. Related works and summary
Group extension – Gossiping attenuated BF Group B Group C Group A • Community is divided into a number of groups • Peers within the same group operate as described above • Peers from different groups will gossip an attenuated Bloom filter that is a summary of the global index for their groups
query ranked list Name Peer rank PC3 0.84 PC2 0.77 • the peer PA1queries to a random peerPCi from the group C • the peer PCi returns a ranked list of peers in the group C Group extension – Content search Group C Group A PA1 PC2 PCi PC3 Search: • a peer PA1 (group A) try to find documents that are relevant to a query Q • the Bloom filter of the group C contains relevant terms to a query Q
Outline • Introduction • 2. Architecture • local and global index • gossiping algorithm • content search and ranking • 3. Performance • content search and ranking • gossiping algorithm • 4. Group extension • 5. Related works and summary
Related works • Tapestry, Pastry, Chord and CAN • use distributed hash tables (DHT): key – value • provide search mechanisms based on the key Problem:The high cost of publishing thousands of keys per file • Cori and Closs • address the problems of database selection and ranking fusion on distributed collections • use servers to keep a reduced index of the contents Problems:The need of centralized resources The possibility of a single point failure
Summary Content addressable publish/ subscribe service for unstructured P2P communities • Gossiping algorithm • provides the propagation of shared information everywhere • provides the robustness to the dynamic peer behavior • Content search and ranking algorithm • provides the search capabilities comparable with centralized resources • operates independently of how documents are distributed throughout the community The first work that supports content ranking.
PlanetP Questions ?