Clustering Methods: K-means and Python Implementation

CSE 482: Big Data Analysis Lecture 12 (Cluster Analysis)

Outline • What is Clustering? • K-means Clustering • Cluster validation

Problem Definition • Given a collection of data instances • Each instance is by characterized by an attribute set x • Task: • Partition the data in such a way that instances in the same partition (cluster) are more similar to each other than to instances in other partitions (clusters)

Examples of Clustering Task

Lots of Clustering Methods available • Partitional clustering • K-means and its variants (self-organized maps, bisecting k-means, fuzzy k-means, kernel k-means) • Spectral clustering • Support vector clustering • Density-based clustering • Hierarchical clustering • Agglomerative (Single-link, complete-link, group average, Ward’s method)

K-means Clustering Partitional clustering approach Each cluster is associated with a centroid (center) Number of clusters, K, must be specified

Illustrating K-means

Inputs of K-means algorithm • Proximity measure • Euclidean distance • Cosine similarity • Correlation coefficient • Number of clusters, K • Input data to be clustered

Outputs of K-means algorithm • Cluster membership for each data instance • Centroid vectors • Each contains average values of the features associated with instances assigned to the cluster

Example Movie Ratings Cluster 1 U1 U2 U3 U5 U4 U6 Cluster 2 Centroids

Python K-means Example • Use scikit-learn package from sklearn import cluster • Steps • Load the input data • Create cluster object k_means = cluster.KMeans (n_clusters = # clusters) • Apply k-means clustering to data k_means.fit( data ) • Obtain the centroids and cluster labels centroids = k_means.cluster_centers_ labels = k_means.labels_

Python Example Movie Ratings

Python Example

Python Example Applying cluster to new data points

K-means Clustering – Details • Designed to minimize sum of square error (SSE) • SSE measures the sum-of-squared distance between every data point xi to their cluster centroid cj • SSE can be used to determine the number of clusters

Sum of Squared Error (SSE) Centroids Movie Ratings d(U1,Cluster1)2

Sum of Squared Error (SSE) Use the “elbow” of the curve to identify the number of clusters

K-means Clustering – Details Initial centroids are often chosen randomly. Clusters produced may vary from one run to another. Complexity is O( n * K * I * d ) n = number of points, K = number of clusters, I = number of iterations, d = number of attributes

Importance of Choosing Initial Centroids …

10 Clusters Example Initial Centroids

10 Clusters Example Final Clustering

K-means++

10 Clusters Example Kmeans++ Initial Centroids

10 Clusters Example Final Clustering

Limitations of K-means K-means has problems when clusters are of differing Sizes Densities Non-globular shapes K-means also has problems when the data contains outliers.

Limitations of K-means: Differing Sizes K-means (3 Clusters) Original Points

Limitations of K-means: Differing Density K-means (3 Clusters) Original Points

Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters)

Python Example • Image segmentation • Partition an input image into clusters

Python Example img is 600 x 800 ndarray data is 480000 ndarray

Python Example

Summary • Today’s lecture • Clustering • k-means clustering • Python implementation • Next lecture • Cluster validation • Hierarchical clustering

Clustering Methods: K-means and Python Implementation