K-means algorithm

Speech and Image Processing UnitSchool of Computing University of Eastern Finland K-means algorithm Clustering Methods: Part 2a Pasi Fränti

K-means overview • Well-known clustering algorithm • Number of clusters must be chosen in advance • Strengths: • Vectors can flexibly change clusters during the process. • Always converges to a local optimum. • Quite fast for most applications. • Weaknesses: • Quality of the output depends on the initial codebook. • Global optimum solution not guaranteed.

K-means pseudo code X: a set of N data vectors Data set CI: initialized k cluster centroids Number of clusters, C: the cluster centroids of k-clusteringrandom initial centroids P = {p(i) | i = 1, …, N} is the cluster label of X KMEANS(X, CI) → (C, P) REPEAT Cprevious← CI; FOR all i ∈[1, N] DO Generate new optimal paritions p(i)← arg min d(xi, cj); l ≤ j ≤ k FOR all j ∈[1, k] DO Generate optimal centroids cj← Average of xi, whose p(i) = j; UNTIL C = Cprevious

E 6 C D F 5 6 A B 1 5 2 4 5 1 8 c3 1 c1 c2 2 4 5 1 8 (1/4) K-means example Data set X: a set of N data vectors N = 6 Number of clusters Random initial centroids Initial codebook: c1 = C, c2 = D, c3 = E CI: initialized k clustercentroids k = 3

c3 E C c2 F D c1 6 6 A B 5 5 c3 1 1 c2 2 2 4 5 1 1 4 5 8 8 c1 (2/4) K-means example Generate optimal partitions Distance matrix (Euclidean distance) A B C D E F c1 c2 c3 Generate optimal centroids After 1st iteration: MSE = 9.0

E c3 F c2 C D c1 6 A B 5 c3 6 1 c2 5 2 4 5 1 8 c1 1 2 4 5 1 8 (3/4) K-means example Generate optimal partitions Distance matrix (Euclidean distance) A B C D E F c1 c2 c3 Generate optimal centroids After 2nd iteration: MSE = 1.78

E c3 c2 F C D c1 6 A B 5 1 2 4 5 1 8 (4/4) K-means example Generate optimal partitions Distance matrix (Euclidean distance) A B C D E F c1 c2 c3 No object move - stop After 3rd iteration: MSE = 0.31

E C F c3 D A B 6 6 6 c1 c2 5 5 5 1 1 1 2 2 2 4 4 4 5 5 5 1 1 1 8 8 8 Counter example 1 2 3 Initial codebook: c1 = A, c2 = B, c3 = C

Two ways to improve k-means • Repeated k-means • Try several random initializations and take the best. • Multiplies processing time. • Works for easier data sets. • Better initialization • Use some better heuristic to allocate the initial distribution of code vectors. • Designing good initialization is not any easier than designing good clustering algorithm at the first place! • K-means can (and should) anyway be applied as fine-tuning of the result of another method.

References • Forgy, E. W. (1965) Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics 21, 768-769. • McQueen, J. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281-297. Berkeley, CA: University of California Press. • Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics 28, 100-108. • Xu, M.: K-Means Based Clustering And Context Quantization. University of Joensuu, Computer Science, Academic Dissertation, 2005.

K-means algorithm