260 likes | 469 Views
A Genetic Algorithm Approach to K -Means Clustering. Craig Stanek CS401 November 17, 2004. What Is Clustering?. “partitioning the data being mined into several groups (or clusters) of data instances, in such a way that:
E N D
A Genetic Algorithm Approach to K-Means Clustering Craig Stanek CS401 November 17, 2004
What Is Clustering? • “partitioning the data being mined into several groups (or clusters) of data instances, in such a way that: • Each cluster has instances that are very similar (or “near”) to each other, and • The instances in each cluster are very different (or “far away”) from the instances in the other clusters” • --Alex A. Freitas, “Data Mining and Knowledge Discovery with Evolutionary Algorithms”
Why Cluster? Segmentation and Differentiation
Why Cluster? Outlier Detection
Why Cluster? Classification
K-Means Clustering • Specify K clusters • Randomly initialize K “centroids” • Classify each data instance to closest cluster according to distance from centroid • Recalculate cluster centroids • Repeat steps (3) and (4) until no data instances move to a different cluster
Drawbacks of K-Means Algorithm • Local rather than global optimum • Sensitive to initial choice of centroids • K must be chosen apriori • Minimizes intra-cluster distance but does not consider inter-cluster distance
Problem Statement • Can a Genetic Algorithm approach do better than standard K-means Algorithm? • Is there an alternative fitness measure that can take into account both intra-cluster similarity and inter-cluster differentiation? • Can a GA be used to find the optimum number of clusters for a given data set?
58 244 23 162 113 Representation of Individuals • Randomly generated number of clusters • Medoid-based integer string (each gene is a distinct data instance) • Example:
Genetic Algorithm Approach Why Medoids?
Genetic Algorithm Approach Why Medoids?
Genetic Algorithm Approach Why Medoids?
5 36 80 108 82 147 82 6 Recombination Parent #1: 36 108 82 Parent #2: 6 5 80 147 82 108 Child #1: Child #2:
Fitness Function Let rijrepresent the jth data instance of the ith cluster and Mi be the medoid of the ith cluster Let X = Let Y = Fitness = Y / X
Experimental Setup Iris Plant Data (UCI Repository) • 150 data instances • 4 dimensions • Known classifications • 3 classes • 50 instances of each
Experimental Setup Iris Data Set
Experimental Setup Iris Data Set
Standard K-Means Clustering Iris Data Set
Medoid-Based EA Iris Data Set
Variable Number of Clusters EA Iris Data Set
Conclusions • GA better at obtaining globally optimal solution • Proposed fitness function shows promise • Difficulty letting GA determine “correct” number of clusters on its own
Future Work • Other data sets • Alternative fitness function • Scalability • GA comparison to simulated annealing