PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities

Seminar “Peer-to-peer Information Systems” PlanetP: Using Gossiping to Build Content Addressable Peer-to-PeerInformation Sharing Communities F. M. Cuenca-Acuna, C. Peery, R. P. Martin, and T. D. Nguyen Speaker: Sergey Lugin January 2003

Outline • Introduction • 2. Architecture • local and global index • gossiping algorithm • content search and ranking • 3. Performance • content search and ranking • gossiping algorithm • 4. Group extension • 5. Related works and summary

PlanetP features: • a gossiping layer used to globally replicate an extremely compact content index • a completely distributed content search and ranking algorithm that helps users find the most relevant information Introduction PlanetP is a content addressable publish/ subscribe service for unstructured peer-to-peer (P2P) communities: • no central management • resilience to rapid membership changes • content search and ranking • scaling up to several thousand peers

Local index Local Files • Contents of peer files (video, music) are described in XML snippets that contain pointers to corresponded files • Each peer runs a Simple Web Server to support peer’s retrieval of these files (to be considered further) • Each peer summarizes the set of unique terms in its local index in a BLOOM FILTER pointers Local index Simple Web Server XML Snippets unique terms BLOOM FILTER: 0 1 0 0 1 1 0 0 1 0 1 0 1 0 1 1

Terms: paper … cat hash functions 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 1 0 1 0 1 0 1 1 yes BLOOM FILTER: BLOOM FILTER: query term:cat hash functions BLOOM FILTER The Bloom filter is a bit-array that allows to quickly test a membership in a large term set using hash functions. • compact form for the representation of a large term set • quickness • only Boolean answer ( Yes / No ) • “false positive” problem

Global index • Global directory describes all peers and all available information • in the community Global directory (index) • Global directoryis replicated everywhere by using gossiping • Gossiping is used also to keep peers synchronized • joining or leaving of peers • new data

PlanetP’s gossiping algorithm: • rumoring algorithm • anti-entropy algorithm • partial anti-entropy algorithm Demer’s algorithm Gossiping algorithm PlanetP uses gossiping to replicate the global index across peers of a P2P community • robustness to the dynamic joining and leaving of peers • independence from any particular subset of peers being on-line

rumor • every Tg seconds, a peer Px pushes this change (a rumor) to a peer Py chosen randomly from the global index rumor Py rumor • if the rumor is new information for the peer Py, then it starts to push this rumor just like Px rumor Rumoring algorithm Purpose:The algorithm provides spreading of new information across a P2P community A peer has a change Px A peer has a change: • the peer Px stops pushing the rumor after it has contacted n consecutive peers that already heard the rumor

pull • every Tr seconds, a peer Pxattempts to pull information from a peer Py chosen randomly from the global index Py • the peer Py returns the summaryof its version of the global index global index summary Px Anti-entropy algorithm Purpose:The algorithm allows to avoid the possibility of rumors dying out before reaching everyone All peers: • then Px can ask Py for any new information that it does not have

rumor • a peer Px pushes a rumor to a peer Py • the peer Py piggybacks the identifiers of a small number of the most recent rumors Py identifiers of recent rumors Px Partial anti-entropy algorithm Purpose:The algorithmallows to reduce the time of new information spreading Extension of each push operation: • then Px can pull any recent rumor that did not reach it The process requires only one extra message exchange in the case that Pyknows something that Px does not

Joining of new peers: • gossiping NEW Leaving of present peers: • a peer discovers that another peer is OFF-LINE when an attempt to communicate with it fails: • the peer status is marked as OFF-LINE • information in the global index is not dropped • if a peer has been marked as OFF-LINE continuously for a time TDead, it is assumed that the peer has left the community permanently: • all information about the peer is dropped OFF-LINE Rejoining of peers: • gossiping (the peer status is marked as ON-LINE) ON-LINE Membership

Content search and ranking • PlanetP implements a Content Ranking Algorithm that • uses the Vector Space Ranking Model • Vector Space Ranking Model • each document and query is abstractly represented as a vector • each dimension is associated with a distinct term • the value of each component of a vector is the weight representing the importance of that term to the corresponding document or query • Relevance of the document and query • the cosine of the angle between the query vector and document vector Q is a query, D is a document, t is a term, wQ,T – the weight of the term t for the query Q, wD,t – the weight of the term t for the document D

Content Ranking Algorithm - Background • TFxIDF is a popular method for term weight assignments • TF is a Term Frequency • how often the term appears in the document • IDF is a Inverse Document Frequency • the inverse of how often this term appears in the entire collection • This technique allows to balance: • the fact that terms frequently used in a document are likely important to describe its meaning • terms that appear in many documents in a collection are not useful for differentiating between these documents

PlanetP’s proposed solution Approximation TFxIDF two sub-problems: 1. Ranking peers according to their likelihood of having relevant documents 2. Deciding on the number of peers to contact and ranking identified documents Approximation TFxIDF • Problems of the computation TFxIDF for a P2P community: • documents are distributed across a P2P community • the peer bandwidth is restricted to use How this technique can be used for a P2P community ?

Ranking of peers Querying Fred Fred Doc - A 0,93 Bob Bob Doc- E Query 0,79 Doc - D 0,77 Tan PlanetP’s Content Search Steps: • Ranking peers 2. Querying the most relevant peers Ranking results Global directory name rank

Ranking peers • 1. Ranking peers • new measure similar IDF Inverse Peer Frequency (IPF) • a term that is present in the bloom filter of every peer is not useful for differentiating between the peers for a particular query • IPF for a term t N – is the number of all peers Nt– is the number of peers having the term t Ranking peers for a query Q:

Quering peers 2. Querying the most relevant peers • Problem: As communities grow, it becomes infeasible to contact large subsets of peers for each query • Solution: for a query Q,the user specifies a limit K on the number of potential documents that should be presented • PlanetP sorts ranked peer lists and contacts peers (the most relevant peers) • Each contacted peer returns a set of document URLs together with their relevance

Ranking results ranked documents name rank Doc - A 0,93 K Doc- E 0,79 p 0,77 Doc - D stop • Simulation results showed that p should be a function of the community size N and K as follows: stopping heuristic: C0, C1, C2 – are constant values Quering peers 2. Querying the most relevant peers • PlanetP stops contacting peers when the documents identified bypconsecutive peers fail to contribute to the top K ranked documents Sorted peer list

Performance Performance Study: • Content Search • Efficacy • Gossiping • Time • Bandwidth Usage • Performance Study was based on the developed simulator • The simulator was validated against measurements taken from prototype (up to several hundred peers)

Content search and ranking algorithm • Metrics: Recall (R), Precision (P) • Input data: The collection AP89 was extracted from the TREC collection (Associated Press) • No. Docs = 84678 • No. Unique Terms = 129603 • Different document-to-peer distribution: • Uniform (The worst case for a distributed search) • Weibull (7% of the users in the Gnutella community share more files than all the rest together)

Content search and ranking algorithm T.W is a search engine using TFxIDF (centralized implementation) P.W is the PlanteP’s search engine (Weibull distribution of documents) P.U is the PlanteP’s search engine (Uniform distribution of documents) T.W T.W P.W P.W P.U P.U Recall Precision No. documents requested No. documents requested

Content search and ranking algorithm T.W is a search engine using TFxIDF (centralized implementation) P.W is the PlanteP’s search engine (Weibull distribution of documents) P.U is the PlanteP’s search engine (Uniform distribution of documents) community of 400 peers stopping heuristic No. peers contacted Recall Number of peers No. documents requested

Observations • PlanetP tracks the performance of the centralized implementation closely • Performance is independent of how the shared documents are distributed. • PlanetP’s recall and precision is within 11% of TFxIDF’s implementation. 2. PlanetP scales well for communities of up to 1000 peers, maintaining a relatively constant recall and precision. 3. PlanetP’s stopping heuristic allows to maintain the close recall and precision independently of how the documents are distributed.

Gossiping algorithm Measured factor:propagation time Time required to propagate a single Bloom filter everywhere. LAN-AE: Peers use only push anti-entropy: each peer periodically push a summary of its data structure. The target requests all new information from this summary. LAN: Peers use PlanetP’s gossiping algorithm. Network:45 Mbps • LAN-AE • LAN (PlanetP) PlanetP’s parameters Time (sec) No. peers

Gossiping algorithm PlanetP’s gossiping algorithm Measured factor:propagation time, average bandwidth Network:512 Kbps Different gossiping intervals: 10 sec (DSL-10) 30 sec (DSL-30) 60 sec (DSL-60) • DSL- 60 • DSL- 30 • DSL- 10 • DSL- 60 • DSL- 30 • DSL- 10 Average Bandwidth (Bytes/s) Time (sec) No. peers No. peers

Observations 1. The algorithm significantly outperforms ones that use only push anti-entropy for both propagation time and network volume 2. Propagation time is a logarithmic function of community size 3. Total number of bytes sent is very modest, implying that gossiping is very scalable 4. We can easily trade off propagation time against gossiping bandwidth by increasing or decreasing the gossiping interval

Group extension – Gossiping attenuated BF Group B Group C Group A • Community is divided into a number of groups • Peers within the same group operate as described above • Peers from different groups will gossip an attenuated Bloom filter that is a summary of the global index for their groups

query ranked list Name Peer rank PC3 0.84 PC2 0.77 • the peer PA1queries to a random peerPCi from the group C • the peer PCi returns a ranked list of peers in the group C Group extension – Content search Group C Group A PA1 PC2 PCi PC3 Search: • a peer PA1 (group A) try to find documents that are relevant to a query Q • the Bloom filter of the group C contains relevant terms to a query Q

Related works • Tapestry, Pastry, Chord and CAN • use distributed hash tables (DHT): key – value • provide search mechanisms based on the key Problem:The high cost of publishing thousands of keys per file • Cori and Closs • address the problems of database selection and ranking fusion on distributed collections • use servers to keep a reduced index of the contents Problems:The need of centralized resources The possibility of a single point failure

Summary Content addressable publish/ subscribe service for unstructured P2P communities • Gossiping algorithm • provides the propagation of shared information everywhere • provides the robustness to the dynamic peer behavior • Content search and ranking algorithm • provides the search capabilities comparable with centralized resources • operates independently of how documents are distributed throughout the community The first work that supports content ranking.

PlanetP Questions ?

PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities

PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities

Presentation Transcript

Peer to peer

Peer-to-Peer Knowledge Sharing

Comparing Peer to Peer File Sharing Technologies

PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities

Peer to Peer File Sharing

Coordinating Peer-to-Peer information sources

Peer to peer and file sharing applications

File sharing in peer to peer Netwoks

Virtual communities, Peer-to-Peer Networks

Cross-Community: Peer-to-Peer sharing of healthcare information

Motivating Cooperation in Peer-to-Peer Communities

Can Peer-to-Peer File - sharing be of Help for Research Communities ?

Peer-to-Peer Information Systems

Peer to Peer Information Retrieval

Cross-Community: Peer-to-Peer sharing of healthcare information

Coordinating Peer-to-Peer information sources

Peer-to-Peer Knowledge Sharing

Secure Peer-to-Peer File Sharing

How to Custom-Build your Peer to Peer Car Sharing App?

How to Custom-Build your Peer to Peer Car Sharing App?