210 likes | 449 Views
Genetic Algorithm Using Iterative Shrinking for Solving Clustering Problems. Pasi Fränti and Olli Virmajoki. UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE FINLAND. to be presented at: Data Mining 2003. Problem setup.
E N D
Genetic Algorithm Using Iterative Shrinking for Solving Clustering Problems Pasi Fränti and Olli Virmajoki UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE FINLAND to be presented at: Data Mining 2003
Problem setup • Given N data vectors X={x1, x2, …, xN}, partition the data set into M clusters • Clustering: find the location of the clusters. • 2. Vector quantization: approximate the original data by a set of code vectors.
Agglomerative clustering PNN: Pairwise Nearest Neigbor method • Merges two clusters • Preserves hierarchy of clusters IS: Iterative shrinking method • Removes one cluster • Repartition data vectors in removed cluster
Local optimization of the IS Finding secondary cluster: Removal cost of single vector:
Generalization to the case of unknown number of clusters • Measure variance-ratio F-test for every intermediate clustering from M=1..N. • Select the clustering with minimum F-ratio as final clustering. • No additional computing – except the calculation of the F-ratio.
Genetic algorithm Generate S initial solutions. REPEAT T times • Select best solutions to survive. • Generate new solutions by crossover • Fine-tune solutions END-REPEAT Output the best solution found.
Illustration of crossover + = Crossover
Bridge (256256) d = 16 N = 4096 M = 256 Miss America (360288) d = 16 N = 6480 M = 256 House (256256) d = 3 N = 34112* M =256 Image datasets
Data set S1 d = 2 N = 5000 M = 15 Data set S2 d = 2 N = 5000 M = 15 Data set S3 d = 2 N = 5000 M = 15 Data set S4 d = 2 N = 5000 M = 15 Synthetic data sets
Comparison with image data Popular methods Simplest of the good ones Previous GA NEW!
Comparison with synthetic data Most separable clusters Most overlapping between clusters
What does it cost? Bridge Random: ~0 s K-means: 8 s SOM: 6 minutes GA-PNN: 13 minutes GAIS – short: ~1 hour GAIS – long: ~3 days
Conclusions • Slower but better clustering algorithm. • BEST known clustering algorithm in minimizing MSE Thank you!