1 / 23

PCP2P: Probabilistic Clustering for P2P networks

Odysseas Papapetrou * Wolf Siberski * Norbert Fuhr #. * L3S Research Center, University of Hannover, Germany # Universit ä t Duisburg-Essen, Germany. PCP2P: Probabilistic Clustering for P2P networks. 32nd European Conference on Information Retrieval

janna
Download Presentation

PCP2P: Probabilistic Clustering for P2P networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OdysseasPapapetrou* Wolf Siberski* Norbert Fuhr# * L3S Research Center, University of Hannover, Germany # Universität Duisburg-Essen, Germany PCP2P: Probabilistic Clustering for P2P networks 32nd EuropeanConference on Information Retrieval 28th-31st March 2010, Milton Keynes, UK

  2. Introduction • Why text clustering? • Find related documents • Browse documents by topic • Extract summaries • Build keyword clouds • … • Why text clustering in P2P • An efficient and effective method for IR in P2P • New application area: Social networking - find peers with related interests • When files are distributed  too expensive to collect at a central server

  3. Preliminaries • Distributed Hash Tables (DHTs) • Functionality of a hash table: put(key, value) and get(key) • Peers are organized in a ring structure • DHT Lookup: O(log n) messages get(key)  hash(key)47

  4. Preliminaries • K-Means • Create k random clusters • Compare each document to all cluster vectors/centroids • Assign the document to the cluster with the highest similarity, e.g., cosine similarity allClusters  initializeRandomClusters(k) repeat for document d in my documents do for Cluster c in allClusters do sim  cosineSimilarity(d, c) end for assign(d, cluster with max sim) end for until cluster centroids converge

  5. PCP2P • An unoptimized distributed K-Means • Assign maintenance of each cluster to one peer: Cluster holders • Peer P wants to cluster its document d • Send d to all cluster holders • Cluster holders compute cosine(d,c) • P assigns d to cluster with max. cosine, and notifies the cluster holder • Problem • Each document sent to all cluster holders • Network cost: O(|docs|  k) • Cluster holders get overloaded

  6. PCP2P • Approximation to reduce the network cost… • Compare each document only with the most promising clusters • Observation: A cluster and a document about the same topic will share some of the most frequent topic terms, e.g., Topic “Economy”: crisis, shares, finacial, market, … • Use these most frequent terms as rendezvous terms between the documents and the clusters of each topic

  7. PCP2P • Approximation to reduce the network cost… • Cluster inverted index : frequent cluster terms  summaries • Cluster summary • <Cluster holder IP address, frequent cluster terms, length> • E.g. <132.11.23.32, (politics,157),(merkel,149), 3211> Add to “politics” summary(cluster1) Add to “merkel” summary(cluster1)

  8. PCP2P • Approximation to reduce the network cost… • Cluster inverted index : frequent cluster terms  summaries Add to “chicken” summary(cluster2) Add to “cream” summary(cluster2) Add to “rizzotto” summary(cluster2)

  9. PCP2P • Approximation to reduce the network cost… • Pre-filtering step: Efficiently locate the most promising centroids from the DHT and the rendezvous terms • Lookup most frequent terms only  candidate clusters • Send d to only these clusters for comparing • Assign d to the most similar cluster Which clusters published “politics” Which clusters published “germany” cluster1: summary cluster7: summary cluster4: summary Candidate Clusters cluster1 cluster7  Cos: 0.3  Cos: 0.2  Cos: 0.4 cluster4

  10. PCP2P • Approximation to reduce the network cost… • Probabilistic guarantees in the paper: • The optimal cluster will be included in with high probability • Desired correctness probability  # top indexed terms per cluster, # top lookup terms per document • The cost is the minimal that satisfies the desired correctness probability

  11. PCP2P • How to reduce comparisons even further… • Do not compare with all clusters in • Full comparison step filtering • Use the summaries collected from the DHT to estimate the cosine similarity for all clusters in • Use estimations to filter out unpromising clusters  Send d only to the remaining • Assign d to the cluster with the maximum cosine similarity

  12. PCP2P • Full comparison step filtering… • Estimate cosine similarity ECos(d,c), for all c in • Send d to the cluster with maximum ECos, • Remove all clusters with ECos< Cos(d, ) • Repeat until is empty • Assign to the best cluster Candidate Clusters in add cluster1 cluster7 cluster4 Cos:0.38 cluster1: ECos:0.4 cluster7: ECos:0.2 cluster4: ECos:0.5 Cos:0.37

  13. PCP2P • Full comparison step filtering… • Two filtering strategies • Conservative • Compute an upper bound for ECos  always correct • Zipf-based • Estimate ECos assuming that the cluster terms follow Zipf distribution • Introduces small number of errors • Clusters filtered out more aggressively  further cost reduction • Details and proofs in the paper…

  14. Evaluation • Evaluation objectives • Clustering quality • Entropy and Purity • Approximation quality (# of misclustered documents) • Cost and scalability • Number of messages, Transfer volume • Number of comparisons • Control parameters • Number of peers, documents, clusters • Desired probabilistic guarantees • Document collection: • Reuters (100 000 documents) • Synthetic (up to1 Million) created using generative topic models • Baselines • LSP2P: State-of-the-art in P2P clustering based on gossiping • DKMeans: Unoptimized distributed K-Means

  15. Evaluation – Clustering quality Entropy Lower is better # misclustered documents Lower is better • Both conservative and Zipf-based strategy closely approximate K-Means • Conservative always better than Zipf-based • Correctness probability always satisfied • High-dimensionality + large networks  LSP2P not suitable!

  16. Evaluation – Network Cost Correctness Probability Network size • Both conservative and Zipf-based have substantially lower cost than DKMeans • Zipf-based filters out the clusters more aggressively  more efficient than conservative • Cost of PCP2P scales logarithmically with network size

  17. Evaluation – Network cost/scalability • More results in the paper: • Quality • Independent of network and dataset size • Independent of number of clusters • Independent of collection characteristics (zipf exponent) • Cost • Similar results for transfer volume and # document-cluster comparisons • Cost reduction even more substantial for higher number of clusters • PCP2P cost reduces with the collection characteristic exponent (the Zipf exponent of the documents) • Load balancing does not affect scalability

  18. Conclusions • Efficient and scalable text clustering for P2P networks with probabilistic guarantees • Pre-filtering strategy: rendezvous points on frequent terms • Two full-comparison filtering strategies • Conservative filtering • Zipf-based filtering • Outperforms current state of the art in P2P clustering • Approximates K-Means quality with a fraction of the cost • Current work • Apply the core ideas of PCP2P to different clustering algorithms, and to different application scenarios • e.g., more efficient centralized text clustering based on an inverted index

  19. Thank you… Questions?

  20. Load Balancing • Load at Cluster Holders • Maintaining the cluster centroids (computational) • Compute cosine similarities (networking + computational) • To avoid overloading, delegate the comparison task: • Helper cluster holders • Include their contact details in the summary • Each helper takes over some comparisons • Cluster size #helpers

  21. Additional experiments

  22. Additional experiments

  23. Additional experiments • Experimental configuration • Reuters dataset • 10000 peers, 20% churn per iteration

More Related