790 likes | 810 Views
Explore the efficiency of a random swap algorithm for adjusting centroids in the K-means clustering method. The algorithm iterates through random swaps, reallocates vectors, and creates new clusters to improve clustering. We analyze time complexity, iterations, and swap types, using examples from image quantization and theoretical estimations.
E N D
Pasi Fränti Random Swap algorithm 24.4.2018 P. Fränti, "Efficiency of random swap clustering", Journal of Big Data, 2018.
Definitions and data Set of N data points: X={x1, x2, …, xN} Partition of the data: P={p1, p2, …, pk}, Set of k cluster prototypes (centroids): C={c1, c2, …, ck},
Clustering problem Objective function: Optimality of partition: Optimality of centroid:
K-means algorithm X = Data set C = Cluster centroids P = Partition K-Means(X, C) → (C, P) REPEAT Cprev←C; FOR i=1 TO N DOpi← FindNearest(xi, C); FOR j=1 TO k DO cj← Average of xipi = j; UNTIL C = Cprev Optimal partition Optimal centoids
Pigeon hole principle CI = Centroid index: P. Fränti, M. Rezaei and Q. Zhao "Centroid index: cluster level similarity measure” Pattern Recognition, 47 (9), 3034-3045, September 2014, 2014.
Steps of the swap 1. Random swap: 2. Re-allocate vectors from old cluster: 3. Create new cluster:
Local re-partition Re-allocate vectors Create new cluster
Iterate by k-means 1st iteration
Iterate by k-means 2nd iteration
Iterate by k-means 3rd iteration
Iterate by k-means 16th iteration
Iterate by k-means 17th iteration
Iterate by k-means 18th iteration
Iterate by k-means 19th iteration
Final result 25 iterations
Data sets visualizedImages House Bridge Miss America Europe RGB color 4x4 blocks 4x4 blocks frame differential Differential coordinates 16-d 16-d 3-d 2-d
Data sets visualizedArtificial G2-2-30 G2-2-50 G2-2-70 A1 A2 A3
Efficiency of the random swap Total time to find correct clustering: Time per iteration Number of iterations Time complexity of single iteration: Swap: O(1) Remove cluster: 2kN/k = O(N) Add cluster: 2N = O(N) Centroids: 2N/k + 2N/k + 2 = O(N/k) K-means: IkN = O(IkN) Bottleneck!
Efficiency of the random swap Total time to find correct clustering: Time per iteration Number of iterations Time complexity of single iteration: Swap: O(1) Remove cluster: 2kN/k = O(N) Add cluster: 2N = O(N) Centroids: 2N/k + 2N/k + 2 = O(N/k) (Fast) K-means: 4N = O(N) 2 iterations only! T. Kaukoranta, P. Fränti and O. Nevalainen "A fast exact GLA based on code vector activity detection" IEEE Trans. on Image Processing, 9 (8), 1337-1342, August 2000.
Estimated and observed steps Bridge N=4096, k=256, N/k=16, 8
Effect of K-means iterations 1 5 2 3 4 Bridge
Effect of K-means iterations 1 3 2 5 4 Birch2
Three types of swaps • Trial swap • Accepted swap • Successful swap MSE improves CI improves Before swap Accepted Successful CI=2 CI=2 CI=1 20.12 20.09 15.87
Number of swaps neededExample with 35 clusters A3 CI=4 CI=9
Statistical observations N=5000, k=15, d=2, N/k=333, 4.1
Statistical observations N=5000, k=15, d=2, N/k=333, 4.8
Probability of good swap • Select a proper prototype to remove: • There are k clusters in total: premoval=1/k • Select a proper new location: • There are N choices: padd=1/N • Only k are significantly different: padd=1/k • Both happens same time: • k2significantly different swaps. • Probability of each different swap is pswap=1/k2 • Open question: how many of these are good? p = (/k)(/k) = O(/k)2
Probability of not finding good swap: Expected number of iterations • Estimated number of iterations:
Observed probability (%) of fail N=5000, k=15, d=2, N/k=333, 4.5
Observed probability (%) of fail N=1024, k=16, d=32-128, N/k=64, 1.1
Bounds for the iterations Upper limit: Lower limit similarly; resulting in: