1 / 34

Clustering Methods: K-means and Python Implementation

Learn about clustering and k-means clustering, with a focus on Python implementation. Explore various clustering methods and their limitations, such as differing sizes, densities, and non-globular shapes.

Download Presentation

Clustering Methods: K-means and Python Implementation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 482: Big Data Analysis Lecture 12 (Cluster Analysis)

  2. Outline • What is Clustering? • K-means Clustering • Cluster validation

  3. Problem Definition • Given a collection of data instances • Each instance is by characterized by an attribute set x • Task: • Partition the data in such a way that instances in the same partition (cluster) are more similar to each other than to instances in other partitions (clusters)

  4. Examples of Clustering Task

  5. Lots of Clustering Methods available • Partitional clustering • K-means and its variants (self-organized maps, bisecting k-means, fuzzy k-means, kernel k-means) • Spectral clustering • Support vector clustering • Density-based clustering • Hierarchical clustering • Agglomerative (Single-link, complete-link, group average, Ward’s method)

  6. K-means Clustering Partitional clustering approach Each cluster is associated with a centroid (center) Number of clusters, K, must be specified

  7. Illustrating K-means

  8. Inputs of K-means algorithm • Proximity measure • Euclidean distance • Cosine similarity • Correlation coefficient • Number of clusters, K • Input data to be clustered

  9. Outputs of K-means algorithm • Cluster membership for each data instance • Centroid vectors • Each contains average values of the features associated with instances assigned to the cluster

  10. Example Movie Ratings Cluster 1 U1 U2 U3 U5 U4 U6 Cluster 2 Centroids

  11. Python K-means Example • Use scikit-learn package from sklearn import cluster • Steps • Load the input data • Create cluster object k_means = cluster.KMeans (n_clusters = # clusters) • Apply k-means clustering to data k_means.fit( data ) • Obtain the centroids and cluster labels centroids = k_means.cluster_centers_ labels = k_means.labels_

  12. Python Example Movie Ratings

  13. Python Example

  14. Python Example Applying cluster to new data points

  15. K-means Clustering – Details • Designed to minimize sum of square error (SSE) • SSE measures the sum-of-squared distance between every data point xi to their cluster centroid cj • SSE can be used to determine the number of clusters

  16. Sum of Squared Error (SSE) Centroids Movie Ratings d(U1,Cluster1)2

  17. Sum of Squared Error (SSE) Use the “elbow” of the curve to identify the number of clusters

  18. K-means Clustering – Details Initial centroids are often chosen randomly. Clusters produced may vary from one run to another. Complexity is O( n * K * I * d ) n = number of points, K = number of clusters, I = number of iterations, d = number of attributes

  19. Importance of Choosing Initial Centroids …

  20. 10 Clusters Example Initial Centroids

  21. 10 Clusters Example Final Clustering

  22. K-means++

  23. 10 Clusters Example Kmeans++ Initial Centroids

  24. 10 Clusters Example Final Clustering

  25. Limitations of K-means K-means has problems when clusters are of differing Sizes Densities Non-globular shapes K-means also has problems when the data contains outliers.

  26. Limitations of K-means: Differing Sizes K-means (3 Clusters) Original Points

  27. Limitations of K-means: Differing Density K-means (3 Clusters) Original Points

  28. Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters)

  29. Python Example • Image segmentation • Partition an input image into clusters

  30. Python Example img is 600 x 800 ndarray data is 480000 ndarray

  31. Python Example

  32. Python Example

  33. Python Example

  34. Summary • Today’s lecture • Clustering • k-means clustering • Python implementation • Next lecture • Cluster validation • Hierarchical clustering

More Related