340 likes | 350 Views
Learn about clustering and k-means clustering, with a focus on Python implementation. Explore various clustering methods and their limitations, such as differing sizes, densities, and non-globular shapes.
E N D
CSE 482: Big Data Analysis Lecture 12 (Cluster Analysis)
Outline • What is Clustering? • K-means Clustering • Cluster validation
Problem Definition • Given a collection of data instances • Each instance is by characterized by an attribute set x • Task: • Partition the data in such a way that instances in the same partition (cluster) are more similar to each other than to instances in other partitions (clusters)
Lots of Clustering Methods available • Partitional clustering • K-means and its variants (self-organized maps, bisecting k-means, fuzzy k-means, kernel k-means) • Spectral clustering • Support vector clustering • Density-based clustering • Hierarchical clustering • Agglomerative (Single-link, complete-link, group average, Ward’s method)
K-means Clustering Partitional clustering approach Each cluster is associated with a centroid (center) Number of clusters, K, must be specified
Inputs of K-means algorithm • Proximity measure • Euclidean distance • Cosine similarity • Correlation coefficient • Number of clusters, K • Input data to be clustered
Outputs of K-means algorithm • Cluster membership for each data instance • Centroid vectors • Each contains average values of the features associated with instances assigned to the cluster
Example Movie Ratings Cluster 1 U1 U2 U3 U5 U4 U6 Cluster 2 Centroids
Python K-means Example • Use scikit-learn package from sklearn import cluster • Steps • Load the input data • Create cluster object k_means = cluster.KMeans (n_clusters = # clusters) • Apply k-means clustering to data k_means.fit( data ) • Obtain the centroids and cluster labels centroids = k_means.cluster_centers_ labels = k_means.labels_
Python Example Movie Ratings
Python Example Applying cluster to new data points
K-means Clustering – Details • Designed to minimize sum of square error (SSE) • SSE measures the sum-of-squared distance between every data point xi to their cluster centroid cj • SSE can be used to determine the number of clusters
Sum of Squared Error (SSE) Centroids Movie Ratings d(U1,Cluster1)2
Sum of Squared Error (SSE) Use the “elbow” of the curve to identify the number of clusters
K-means Clustering – Details Initial centroids are often chosen randomly. Clusters produced may vary from one run to another. Complexity is O( n * K * I * d ) n = number of points, K = number of clusters, I = number of iterations, d = number of attributes
10 Clusters Example Initial Centroids
10 Clusters Example Final Clustering
10 Clusters Example Kmeans++ Initial Centroids
10 Clusters Example Final Clustering
Limitations of K-means K-means has problems when clusters are of differing Sizes Densities Non-globular shapes K-means also has problems when the data contains outliers.
Limitations of K-means: Differing Sizes K-means (3 Clusters) Original Points
Limitations of K-means: Differing Density K-means (3 Clusters) Original Points
Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters)
Python Example • Image segmentation • Partition an input image into clusters
Python Example img is 600 x 800 ndarray data is 480000 ndarray
Summary • Today’s lecture • Clustering • k-means clustering • Python implementation • Next lecture • Cluster validation • Hierarchical clustering