430 likes | 648 Views
Speech and Image Processing Unit School of Computing University of Eastern Finland. Genetic algorithms (GA) for clustering. Clustering Methods: Part 2e. Pasi Fränti. General structure. Genetic Algorithm: Generate S initial solutions REPEAT Z iterations Select best solutions
E N D
Speech and Image Processing UnitSchool of Computing University of Eastern Finland Genetic algorithms (GA)for clustering Clustering Methods: Part 2e Pasi Fränti
General structure Genetic Algorithm: Generate S initial solutions REPEAT Z iterations Select best solutions Create new solutions by crossover Mutate solutions END-REPEAT
Components of GA • Representation of solution • Selection method • Crossover method • Mutation Most critical !
Representation of solution • Partition (P): • Optimal centroid can be calculated from P. • Only local changes can be made. • Codebook (C): • Optimal partition can be calculated from C. • Calculation of P takes O(NM) slow. • Combined (C, P): • Both data structures are needed anyway. • Computationally more efficient.
Selection method • To select which solutions will be used in crossover for generating new solutions. • Main principle: good solutions should be used rather than weak solutions. • Two main strategies: • Roulette wheel selection • Elitist selection. • Exact implementation not so important.
Roulette wheel selection • Select two candidate solutions for the crossover randomly. • Probability for a solution to be selected is weighted according to its distortion:
Elitist selection • Main principle: select all possible pairs among the best candidates. Elitist approach using zigzag scanning among the best solutions
Crossover methods Different variants for crossover: • Random crossover • Centroid distance • Pairwise crossover • Largest partitions • PNN Local fine-tuning: • All methods give new allocation of the centroids. • Local fine-tuning must be made by K-means. • Two iterations of K-means is enough.
Random crossover Select M/2 centroids randomly from the two parent. Solution 1 Solution 2 +
c4 c4 c3 c2 c2 c3 c1 c1 2 4 5 1 8 Explanation Data point Centroid M – number of clusters Parent solution A Parent solution B New Solution: How to create a new solution? Picking M/2 randomly chosen cluster centroids from each of the two parents in turn. How many solutions are there? 36 possibilities how to create a new solution. What is the probability to select a good one? Not high, some are good but K-Means is needed, most are bad. See statistics. M = 4 Some possibilities: Rough statistics: Optimal: 1 Good: 7 Bad: 28
c4 c4 c2 c3 c2 c1 c1 c3 2 4 5 1 8 c1 c1 c1 c4 c4 c4 c2 c3 c2 c2 c3 c3 Parent solution A Parent solution B Childsolution(optimal) Childsolution(good) Childsolution(bad)
Centroid distance crossover [Pan, McInnes, Jack, 1995: Electronics Letters ] [Scheunders, 1997: Pattern Recognition Letters ] • For each centroid, calculate its distance to the center point of the entire data set. • Sort the centroids according to the distance. • Divide into two sets: central vectors (M/2 closest) and distant vectors (M/2 furthest). • Take central vectors from one codebook and distant vectors from the other.
c4 c4 6 6 c4 5 5 Ced c1 Ced c4 c3 c2 c1 c3 c2 1 c2 1 c2 2 4 5 1 8 1) Distances d(ci, Ced): A:d(c4, Ced) < d(c2, Ced)< d(c1, Ced) < d(c3, Ced) B:d(c1, Ced) < d(c3, Ced)< d(c2, Ced) < d(c4, Ced) 2) Sort centroids according to the distance: A:c4,c2,c1, c3, B:c1, c3, c2, c4 3) Divide into two sets (M = 4): A:central vectors: c4, c2, distant vectors:c1, c3B:central vectors:c1, c3, distant vectors:c2, c4 Explanation c1 Data point c3 Centroid Centroid of entire dataset M – number of clusters c1 c3 Parent solution A Parent solution B 2 4 5 1 8 New solution: Variant (a) Take cental vectors from parent solution A and distant vectors from parent solution B OR Variant (b) Take distant vectors from parent solution A andcentral vectors from parent solution B
c4 c4 6 6 5 5 c3 Ced c3 Ced c4 c4 c2 c2 c1 c2 c1 1 1 c2 2 4 5 1 8 2 4 5 1 8 Explanation c1 Data point c3 Centroid Centroid of entire dataset M – number of clusters c1 c3 Child - variant (a) Child – variant (b) New solution: Variant (a) Take cental vectors from parent solution A and distant vectors from parent solution B OR Variant (b) Take distant vectors from parent solution A andcentral vectors from parent solution B
Pairwise crossover[Fränti et al, 1997: Computer Journal] Greedy approach: • For each centroid, find its nearest centroid in the other parent solution that is not yet used. • Among all pairs, select one of the two randomly. Small improvement: • No reason to consider the parents as separate solutions. • Take union of all centroids. • Make the pairing independent of parent.
Pairwise crossover example Initial parent solutions MSE=11.92109 MSE=8.79109
Pairwise crossover example Pairing between parent solutions MSE=7.34109
Pairwise crossover example Pairing without restrictions MSE=4.76109
Largest partitions[Fränti et al, 1997: Computer Journal] • Select centroids that represent largest clusters. • Selection by greedy manner. • (illustration to appear later)
PNN crossover for GA[Fränti et al, 1997: The Computer Journal] Initial 1 Initial 2 Union Combined After PNN PNN
The PNN crossover method (1)[Fränti, 2000: Pattern Recognition Letters]
Importance of K-means(Random crossover) Bridge Worst Best
Effect of crossover method(with k-means iterations) Binary data (Bridge2)
Mutations • Purpose is to implement small random changes to the solutions. • Happens with a small probability. • Sensible approach: change the location of one centroid by the random swap! • Role of mutations is to simulate local search. • If mutations are needed crossover method is not very good.
Effect of k-means and mutations K-means improves but not vital Mutations alone better than random crossover!
Pseudo code of GAIS[Virmajoki & Fränti, 2006: Pattern Recognition]
PNN vs. IS crossovers Further improvement of about 1%
Optimized GAIS variants GAIS short (optimized for speed): • Create new generations only as long as the best solution keeps improving (T=*). • Use a small population size (Z=10) • Apply two iterations of k‑means (G=2). GAIS long (optimized for quality): • Create a large number of generations (T=100) • Large population size (Z=100) • Iterate k‑means relatively long (G=10).
Conclusions • Best clustering obtained by GA. • Crossover method most important. • Mutations not needed.
References • P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pattern Recognition, 39 (5), 761-765, May 2006. • P. Fränti, "Genetic algorithm with deterministic crossover for vector quantization", Pattern Recognition Letters, 21 (1), 61-68, January 2000. • P. Fränti, J. Kivijärvi, T. Kaukoranta and O. Nevalainen, "Genetic algorithms for large scale clustering problems", The Computer Journal, 40 (9), 547-554, 1997. • J. Kivijärvi, P. Fränti and O. Nevalainen, "Self-adaptive genetic algorithm for clustering", Journal of Heuristics, 9 (2), 113-129, 2003. • J.S. Pan, F.R. McInnes and M.A. Jack, VQ codebook design using genetic algorithms. Electronics Letters,31, 1418-1419, August 1995. • P. Scheunders, A genetic Lloyd-Max quantization algorithm. Pattern Recognition Letters,17, 547-556, 1996.