230 likes | 341 Views
Odysseas Papapetrou * Wolf Siberski * Norbert Fuhr #. * L3S Research Center, University of Hannover, Germany # Universit ä t Duisburg-Essen, Germany. PCP2P: Probabilistic Clustering for P2P networks. 32nd European Conference on Information Retrieval
E N D
OdysseasPapapetrou* Wolf Siberski* Norbert Fuhr# * L3S Research Center, University of Hannover, Germany # Universität Duisburg-Essen, Germany PCP2P: Probabilistic Clustering for P2P networks 32nd EuropeanConference on Information Retrieval 28th-31st March 2010, Milton Keynes, UK
Introduction • Why text clustering? • Find related documents • Browse documents by topic • Extract summaries • Build keyword clouds • … • Why text clustering in P2P • An efficient and effective method for IR in P2P • New application area: Social networking - find peers with related interests • When files are distributed too expensive to collect at a central server
Preliminaries • Distributed Hash Tables (DHTs) • Functionality of a hash table: put(key, value) and get(key) • Peers are organized in a ring structure • DHT Lookup: O(log n) messages get(key) hash(key)47
Preliminaries • K-Means • Create k random clusters • Compare each document to all cluster vectors/centroids • Assign the document to the cluster with the highest similarity, e.g., cosine similarity allClusters initializeRandomClusters(k) repeat for document d in my documents do for Cluster c in allClusters do sim cosineSimilarity(d, c) end for assign(d, cluster with max sim) end for until cluster centroids converge
PCP2P • An unoptimized distributed K-Means • Assign maintenance of each cluster to one peer: Cluster holders • Peer P wants to cluster its document d • Send d to all cluster holders • Cluster holders compute cosine(d,c) • P assigns d to cluster with max. cosine, and notifies the cluster holder • Problem • Each document sent to all cluster holders • Network cost: O(|docs| k) • Cluster holders get overloaded
PCP2P • Approximation to reduce the network cost… • Compare each document only with the most promising clusters • Observation: A cluster and a document about the same topic will share some of the most frequent topic terms, e.g., Topic “Economy”: crisis, shares, finacial, market, … • Use these most frequent terms as rendezvous terms between the documents and the clusters of each topic
PCP2P • Approximation to reduce the network cost… • Cluster inverted index : frequent cluster terms summaries • Cluster summary • <Cluster holder IP address, frequent cluster terms, length> • E.g. <132.11.23.32, (politics,157),(merkel,149), 3211> Add to “politics” summary(cluster1) Add to “merkel” summary(cluster1)
PCP2P • Approximation to reduce the network cost… • Cluster inverted index : frequent cluster terms summaries Add to “chicken” summary(cluster2) Add to “cream” summary(cluster2) Add to “rizzotto” summary(cluster2)
PCP2P • Approximation to reduce the network cost… • Pre-filtering step: Efficiently locate the most promising centroids from the DHT and the rendezvous terms • Lookup most frequent terms only candidate clusters • Send d to only these clusters for comparing • Assign d to the most similar cluster Which clusters published “politics” Which clusters published “germany” cluster1: summary cluster7: summary cluster4: summary Candidate Clusters cluster1 cluster7 Cos: 0.3 Cos: 0.2 Cos: 0.4 cluster4
PCP2P • Approximation to reduce the network cost… • Probabilistic guarantees in the paper: • The optimal cluster will be included in with high probability • Desired correctness probability # top indexed terms per cluster, # top lookup terms per document • The cost is the minimal that satisfies the desired correctness probability
PCP2P • How to reduce comparisons even further… • Do not compare with all clusters in • Full comparison step filtering • Use the summaries collected from the DHT to estimate the cosine similarity for all clusters in • Use estimations to filter out unpromising clusters Send d only to the remaining • Assign d to the cluster with the maximum cosine similarity
PCP2P • Full comparison step filtering… • Estimate cosine similarity ECos(d,c), for all c in • Send d to the cluster with maximum ECos, • Remove all clusters with ECos< Cos(d, ) • Repeat until is empty • Assign to the best cluster Candidate Clusters in add cluster1 cluster7 cluster4 Cos:0.38 cluster1: ECos:0.4 cluster7: ECos:0.2 cluster4: ECos:0.5 Cos:0.37
PCP2P • Full comparison step filtering… • Two filtering strategies • Conservative • Compute an upper bound for ECos always correct • Zipf-based • Estimate ECos assuming that the cluster terms follow Zipf distribution • Introduces small number of errors • Clusters filtered out more aggressively further cost reduction • Details and proofs in the paper…
Evaluation • Evaluation objectives • Clustering quality • Entropy and Purity • Approximation quality (# of misclustered documents) • Cost and scalability • Number of messages, Transfer volume • Number of comparisons • Control parameters • Number of peers, documents, clusters • Desired probabilistic guarantees • Document collection: • Reuters (100 000 documents) • Synthetic (up to1 Million) created using generative topic models • Baselines • LSP2P: State-of-the-art in P2P clustering based on gossiping • DKMeans: Unoptimized distributed K-Means
Evaluation – Clustering quality Entropy Lower is better # misclustered documents Lower is better • Both conservative and Zipf-based strategy closely approximate K-Means • Conservative always better than Zipf-based • Correctness probability always satisfied • High-dimensionality + large networks LSP2P not suitable!
Evaluation – Network Cost Correctness Probability Network size • Both conservative and Zipf-based have substantially lower cost than DKMeans • Zipf-based filters out the clusters more aggressively more efficient than conservative • Cost of PCP2P scales logarithmically with network size
Evaluation – Network cost/scalability • More results in the paper: • Quality • Independent of network and dataset size • Independent of number of clusters • Independent of collection characteristics (zipf exponent) • Cost • Similar results for transfer volume and # document-cluster comparisons • Cost reduction even more substantial for higher number of clusters • PCP2P cost reduces with the collection characteristic exponent (the Zipf exponent of the documents) • Load balancing does not affect scalability
Conclusions • Efficient and scalable text clustering for P2P networks with probabilistic guarantees • Pre-filtering strategy: rendezvous points on frequent terms • Two full-comparison filtering strategies • Conservative filtering • Zipf-based filtering • Outperforms current state of the art in P2P clustering • Approximates K-Means quality with a fraction of the cost • Current work • Apply the core ideas of PCP2P to different clustering algorithms, and to different application scenarios • e.g., more efficient centralized text clustering based on an inverted index
Thank you… Questions?
Load Balancing • Load at Cluster Holders • Maintaining the cluster centroids (computational) • Compute cosine similarities (networking + computational) • To avoid overloading, delegate the comparison task: • Helper cluster holders • Include their contact details in the summary • Each helper takes over some comparisons • Cluster size #helpers
Additional experiments • Experimental configuration • Reuters dataset • 10000 peers, 20% churn per iteration