180 likes | 191 Views
This survey presents an overview of parallel clustering algorithms, including various approaches, challenges, and applications. It discusses the parallel implementation of different clustering algorithms and their practical uses. (443 characters)
E N D
Parallel Clustering Algorithms: Survey Presented by Wooyoung Kim 4/22/09 CSc 8530 Parallel Algorithms, Spring 2009 Dr. Sushil K. Prasad
Outline • General Information of Clustering Analysis • Various Approaches to Data Clustering and Applications • Parallel Implementation of Clustering Algorithms • Applications of Parallel Clustering Algorithms • Applying Clustering Algorithms for Parallelization • Discussion
General Information of Clustering Analysis Clustering : “Organization of a collection of patterns into clusters based on similarity”. Unsupervised learning with unlabeled data. Non-predictive learning task.
General Information of Clustering Analysis General steps in Clustering Algorithms. Pattern Representation Measurements appropriate to the data domain Clustering or Grouping Data Abstraction Assessment of output
General Information of Clustering Analysis Desirable Features in Clustering Algorithms. Scalability Robustness Order insensitivity Minimum user-specified input Mixed data type Arbitrary-shaped clusters Point proportion admissibility: Duplicating data set and re-clustering should not change the results.
General Information of Clustering Analysis Challenges in Clustering Algorithms. Most of the clustering algorithms need a number of repetitions or trials. No universal guide of feature selection or extraction. No universal validation criteria for the quality of the results No standard solution exists.
Various Approaches to Data Clustering and Applications Cross-cutting aspects of various clustering algorithms. Agglomerative vs. Divisive Monothetic vs. Polythetic Hard vs. Fuzzy Deterministic vs. Stochastic: optimal techniques Incremental vs. Non-incremental
Various Approaches to Data Clustering and Applications Clustering Algorithms classifications. Partitioning clustering algorithms Hierarchical Clustering Algorithms Evolutionary Clustering Algorithms Density-based Clustering Algorithms Model-based Clustering Algorithms Graph-based Clustering Algorithms
Parallel Implementation of Clustering Algorithms • Three strategy in the parallelism • Independent parallelism : each processor access the whole data, individual operation, no communication • Task parallelism: each processor operate different function on the (partition of) data. • SPMD parallelism: Each processor execute the same algorithm on each block of data and exchange. • Parallel Clustering: Combination of Task and SPMD
Parallel Implementation of Clustering Algorithms Sequential K-means algorithm Randomly choose k number of cluster centers. Assign data to each cluster. Recalculate the centers Repeat the processes Usually takes O(nmk) for one iteration • Partitioning Clustering Algorithms Parallel K-means algorithm • Master/slave message passing • Divide the data into P blocks • Randomly form k subsets to distribute • SIMD with hypercube network • Divide the data for efficient communication
Parallel Implementation of Clustering Algorithms Sequential Hierarchical algorithms - BIRCH: CF-tree structure. O(n) - Single/complete linkage: Construct MST first. • Hierarchical Clustering Algorithms Parallel hierarchical algorithm - PBIRCH: SPMD model with message-passing. Each slave constructs its own CF-tree and exchange the centers. - MST construction with hypercube network connection.
Parallel Implementation of Clustering Algorithms Evolutionary Strategy Choose a random population of solutions Use the evolutionary operators to generate the next population Repeat until it finds the required solution. ES used for objective function. • Evolutionary Clustering Algorithms Parallel Evolutionary Strategy • Master/Slave model • Master maintain parent solutions, pass it to slaves to perform the evolutionary operators. • Slaves return the new solutions to master.
Parallel Implementation of Clustering Algorithms DBSCAN A cluster is formed from one data point and including all neighbors. Threshold is user-specific • Density-based Clustering Algorithms PDBSCAN • - Master/Slave model • Data replacement : distribute the input by dividing to several blocks • Each slave performs DBSCAN • Merging the local clusters into global clusters
Parallel Implementation of Clustering Algorithms AutoClass Each cluster is represented with different probability distributions. The assignment of each data is represented as probabilities. • Model-based Clustering Algorithms P-AutoClass • - SPMD master/slave model • Divide the input into the processors • Updating parameters for the classifications.
Parallel Implementation of Clustering Algorithms Graph clustering algorithms Clique-based. Center-based. • Graph-based Clustering Algorithms Distributed Clustering Algorithm Master/s;ave • Master selects the cluster headers. • Each node is assigned to each processor • Each processor communicate with each other. • Each node decide its role with its neighbors.
Image segmentation Partitioning circuits in VLSI: bottom-up clique algorithm Layout the sensors in a wireless sensor network PaCE: parallel and fast clustering method of large-scale EST(Expressed Sequence Tags): bottom-up hierarchical clustering General laboratory use: paraKMeans Applications of Parallel Clustering Algorithms
Applying Clustering Algorithms for Parallelization Clustering is useful for divide and conquer strategy Partitioning patterns for efficient computation of nearest neighbors Task-to-processor mapping Finding protein motif sequences in a large data sets.
Discussion • Most of parallelization is based on the master/slave model • A hierarchical algorithm using MST structure can be implemented in parallel with hypercube network connections. • Architectural issues • Distributed vs. Shared • Interconnection topology of processors • Optimal communication strategies • Load balancing • Memory usage and optimization • I/O impact on algorithm performance