1 / 10

K-means algorithm

Speech and Image Processing Unit School of Computing University of Eastern Finland. K-means algorithm. Clustering Methods: Part 2a. Pasi Fränti. K- m eans overview. Well-known clustering algorithm Number of clusters must be chosen in advance Strengths:

mae
Download Presentation

K-means algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech and Image Processing UnitSchool of Computing University of Eastern Finland K-means algorithm Clustering Methods: Part 2a Pasi Fränti

  2. K-means overview • Well-known clustering algorithm • Number of clusters must be chosen in advance • Strengths: • Vectors can flexibly change clusters during the process. • Always converges to a local optimum. • Quite fast for most applications. • Weaknesses: • Quality of the output depends on the initial codebook. • Global optimum solution not guaranteed.

  3. K-means pseudo code X: a set of N data vectors Data set CI: initialized k cluster centroids Number of clusters, C: the cluster centroids of k-clusteringrandom initial centroids P = {p(i) | i = 1, …, N} is the cluster label of X KMEANS(X, CI) → (C, P) REPEAT Cprevious← CI; FOR all i ∈[1, N] DO Generate new optimal paritions p(i)← arg min d(xi, cj); l ≤ j ≤ k FOR all j ∈[1, k] DO Generate optimal centroids cj← Average of xi, whose p(i) = j; UNTIL C = Cprevious

  4. E 6 C D F 5 6 A B 1 5 2 4 5 1 8 c3 1 c1 c2 2 4 5 1 8 (1/4) K-means example Data set X: a set of N data vectors N = 6 Number of clusters Random initial centroids Initial codebook: c1 = C, c2 = D, c3 = E CI: initialized k clustercentroids k = 3

  5. c3 E C c2 F D c1 6 6 A B 5 5 c3 1 1 c2 2 2 4 5 1 1 4 5 8 8 c1 (2/4) K-means example Generate optimal partitions Distance matrix (Euclidean distance) A B C D E F c1 c2 c3 Generate optimal centroids After 1st iteration: MSE = 9.0

  6. E c3 F c2 C D c1 6 A B 5 c3 6 1 c2 5 2 4 5 1 8 c1 1 2 4 5 1 8 (3/4) K-means example Generate optimal partitions Distance matrix (Euclidean distance) A B C D E F c1 c2 c3 Generate optimal centroids After 2nd iteration: MSE = 1.78

  7. E c3 c2 F C D c1 6 A B 5 1 2 4 5 1 8 (4/4) K-means example Generate optimal partitions Distance matrix (Euclidean distance) A B C D E F c1 c2 c3 No object move - stop After 3rd iteration: MSE = 0.31

  8. E C F c3 D A B 6 6 6 c1 c2 5 5 5 1 1 1 2 2 2 4 4 4 5 5 5 1 1 1 8 8 8 Counter example 1 2 3 Initial codebook: c1 = A, c2 = B, c3 = C

  9. Two ways to improve k-means • Repeated k-means • Try several random initializations and take the best. • Multiplies processing time. • Works for easier data sets. • Better initialization • Use some better heuristic to allocate the initial distribution of code vectors. • Designing good initialization is not any easier than designing good clustering algorithm at the first place! • K-means can (and should) anyway be applied as fine-tuning of the result of another method.

  10. References • Forgy, E. W. (1965) Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics 21, 768-769. • McQueen, J. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281-297. Berkeley, CA: University of California Press. • Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics 28, 100-108. • Xu, M.: K-Means Based Clustering And Context Quantization. University of Joensuu, Computer Science, Academic Dissertation, 2005.

More Related