500 likes | 512 Views
Pasi Fränti. Genetic Algorithms for clustering problem. 7.4.2016. General structure. Genetic Algorithm : Generate S initial solutions REPEAT Z iterations Select best solutions Create new solutions by crossover Mutate solutions END-REPEAT. Main principle. Components of GA.
E N D
Pasi Fränti Genetic Algorithmsfor clustering problem 7.4.2016
General structure Genetic Algorithm: Generate S initial solutions REPEAT Z iterations Select best solutions Create new solutions by crossover Mutate solutions END-REPEAT
Components of GA • Representation of solution • Selection method • Crossover method • Mutation Most critical !
Representation of solution • Partition (P): • Optimal centroid can be calculated from P • Only local changes can be made • Codebook (C): • Optimal partition can be calculated from C • Calculation of P takes O(NM) slow • Combined (C, P): • Both data structures are needed anyway • Computationally more efficient
Selection method • To select which solutions will be used in crossover for generating new solutions • Main principle: good solutions should be used rather than weak solutions • Two main strategies: • Roulette wheel selection • Elitist selection • Exact implementation not so important
Roulette wheel selection • Select two candidate solutions for the crossover randomly. • Probability for a solution to be selected is weighted according to its distortion:
Elitist selection • Main principle: select all possible pairs among the best candidates. Elitist approach using zigzag scanning among the best solutions
Crossover methods Different variants for crossover: • Random crossover • Centroid distance • Pairwise crossover • Largest partitions • PNN Local fine-tuning: • All methods give new allocation of the centroids. • Local fine-tuning must be made by K-means. • Two iterations of K-means is enough.
Random crossover Select M/2 centroids randomly from the two parent. Solution 1 Solution 2 +
c4 c4 c3 c2 c2 c3 c1 c1 2 4 5 1 8 Explanation Data point Centroid M – number of clusters Parent solution A Parent solution B New Solution: How to create a new solution? Picking M/2 randomly chosen cluster centroids from each of the two parents in turn. How many solutions are there? 36 possibilities how to create a new solution. Probability to select a good one? Not high, some are good but K-Means is needed, most are bad. See statistics. M = 4 Some possibilities: Rough statistics: Optimal: 1 Good: 7 Bad: 28
c4 c4 c1 c2 c3 c1 c2 c3 2 4 5 1 8 c1 c1 c1 c4 c4 c4 c2 c3 c2 c2 c3 c3 Parent solution A Parent solution B Childsolution(bad) Childsolution(good) Childsolution(optimal)
Centroid distance crossover [Pan, McInnes, Jack, 1995: Electronics Letters ] [Scheunders, 1997: Pattern Recognition Letters ] • For each centroid, calculate its distance to the center point of the entire data set. • Sort the centroids according to the distance. • Divide into two sets: central vectors (M/2 closest) and distant vectors (M/2 furthest). • Take central vectors from one codebook and distant vectors from the other.
c4 c4 c2 c2 1) Distances d(ci, Ced): A:d(c4, Ced) < d(c2, Ced)< d(c1, Ced) < d(c3, Ced) B:d(c1, Ced) < d(c3, Ced)< d(c2, Ced) < d(c4, Ced) 2) Sort centroids according to the distance: A:c4,c2,c1, c3, B:c1, c3, c2, c4 3) Divide into two sets (M = 4): A:central vectors: c4, c2, distant vectors:c1, c3B:central vectors:c1, c3, distant vectors:c2, c4 Explanation c1 Data point c3 Centroid Centroid of entire dataset M – number of clusters c1 c3 Parent solution A Parent solution B c4 6 6 c4 5 5 Ced c1 Ced c3 c2 c1 c3 1 c2 1 2 4 5 1 8 2 4 5 1 8 New solution: Variant (a) Take cental vectors from parent solution A and distant vectors from parent solution B OR Variant (b) Take distant vectors from parent solution A andcentral vectors from parent solution B
c4 c4 c2 c2 Explanation c1 Data point c3 Centroid Centroid of entire dataset M – number of clusters c1 c3 Child - variant (a) Child – variant (b) c4 6 6 5 5 c3 Ced c3 Ced c4 c2 c2 c1 c1 1 1 2 4 5 1 8 2 4 5 1 8 New solution: Variant (a) Take cental vectors from parent solution A and distant vectors from parent solution B OR Variant (b) Take distant vectors from parent solution A andcentral vectors from parent solution B
Pairwise crossover[Fränti et al, 1997: Computer Journal] Greedy approach: • For each centroid, find its nearest centroid in the other parent solution that is not yet used. • Among all pairs, select one of the two randomly. Small improvement: • No reason to consider the parents as separate solutions. • Take union of all centroids. • Make the pairing independent of parent.
Pairwise crossover example Initial parent solutions MSE=11.92109 MSE=8.79109
Pairwise crossover example Pairing between parent solutions MSE=7.34109
Pairwise crossover example Pairing without restrictions MSE=4.76109
Largest partitions[Fränti et al, 1997: Computer Journal] Crossover algorithm: • Each cluster in the solutions A and B is assigned with a number, cluster size S, indicating how many data objects belong to it. • In each phase we pick the centroid of the largest cluster. • Assume that cluster i was chosen from A. The cluster centroid Ci is removed from A to avoid its reselection. • For the same reason we update the cluster sizes of B by removing the effect of those data objects in B that were assigned to the chosen cluster i in A.
Explanation Data point Centroid Largest partitions[Fränti et al, 1997: Computer Journal] Parent solution A Parent solution B S=50 S=50 S=30 S=30 S=100 S=100 c1 S=20 S=20
PNN crossover for GA[Fränti et al, 1997: The Computer Journal] Initial 1 Initial 2 Union Combined After PNN PNN
The PNN crossover method (1)[Fränti, 2000: Pattern Recognition Letters]
Importance of K-means(Random crossover) Bridge Worst Best
Effect of crossover method(with k-means iterations) Binary dataBridge2
Mutations • Purpose is to implement small random changes to the solutions. • Happens with a small probability. • Sensible approach: change the location of one centroid by the random swap! • Role of mutations is to simulate local search. • If mutations are needed crossover method is not very good.
Effect of k-means and mutations K-means improves but less vital Mutations alone better than random crossover!
Agglomerative clustering PNN: Pairwise Nearest Neigbor method • Merges two clusters • Preserves hierarchy of clusters IS: Iterative shrinking method • Removes one cluster • Repartition data vectors in removed cluster
Local optimization of IS Finding secondary cluster: Removal cost of single vector:
Pseudo code of GAIS[Virmajoki & Fränti, 2006: Pattern Recognition]
PNN vs. IS crossovers Further improvement of about 1%
Optimized GAIS variants GAIS short (optimized for speed): • Create new generations only as long as the best solution keeps improving (T=*). • Use a small population size (Z=10) • Apply two iterations of k‑means (G=2). GAIS long (optimized for quality): • Create a large number of generations (T=100) • Large population size (Z=100) • Iterate k‑means relatively long (G=10).
Comparison with image data Popular Simplest of the good ones Previous GA BEST!
What does it cost? Bridge Random: ~0 s K-means: 8 s SOM: 6 minutes GA-PNN: 13 minutes GAIS – short: ~1 hour GAIS – long: ~3 days
Conclusions • Best clustering obtained by GA • Crossover method most important • Mutations not needed
References • P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pattern Recognition, 39 (5), 761-765, May 2006. • P. Fränti, "Genetic algorithm with deterministic crossover for vector quantization", Pattern Recognition Letters, 21 (1), 61-68, January 2000. • P. Fränti, J. Kivijärvi, T. Kaukoranta and O. Nevalainen, "Genetic algorithms for large scale clustering problems", The Computer Journal, 40 (9), 547-554, 1997. • J. Kivijärvi, P. Fränti and O. Nevalainen, "Self-adaptive genetic algorithm for clustering", Journal of Heuristics, 9 (2), 113-129, 2003. • J.S. Pan, F.R. McInnes and M.A. Jack, VQ codebook design using genetic algorithms. Electronics Letters,31, 1418-1419, August 1995. • P. Scheunders, A genetic Lloyd-Max quantization algorithm. Pattern Recognition Letters,17, 547-556, 1996.
Working space Text box