A Genetic Algorithm Approach to K -Means Clustering

A Genetic Algorithm Approach to K-Means Clustering Craig Stanek CS401 November 17, 2004

What Is Clustering? • “partitioning the data being mined into several groups (or clusters) of data instances, in such a way that: • Each cluster has instances that are very similar (or “near”) to each other, and • The instances in each cluster are very different (or “far away”) from the instances in the other clusters” • --Alex A. Freitas, “Data Mining and Knowledge Discovery with Evolutionary Algorithms”

Why Cluster? Segmentation and Differentiation

Why Cluster? Outlier Detection

Why Cluster? Classification

K-Means Clustering • Specify K clusters • Randomly initialize K “centroids” • Classify each data instance to closest cluster according to distance from centroid • Recalculate cluster centroids • Repeat steps (3) and (4) until no data instances move to a different cluster

Drawbacks of K-Means Algorithm • Local rather than global optimum • Sensitive to initial choice of centroids • K must be chosen apriori • Minimizes intra-cluster distance but does not consider inter-cluster distance

Problem Statement • Can a Genetic Algorithm approach do better than standard K-means Algorithm? • Is there an alternative fitness measure that can take into account both intra-cluster similarity and inter-cluster differentiation? • Can a GA be used to find the optimum number of clusters for a given data set?

58 244 23 162 113 Representation of Individuals • Randomly generated number of clusters • Medoid-based integer string (each gene is a distinct data instance) • Example:

Genetic Algorithm Approach Why Medoids?

5 36 80 108 82 147 82 6 Recombination Parent #1: 36 108 82 Parent #2: 6 5 80 147 82 108 Child #1: Child #2:

Fitness Function Let rijrepresent the jth data instance of the ith cluster and Mi be the medoid of the ith cluster Let X = Let Y = Fitness = Y / X

Experimental Setup Iris Plant Data (UCI Repository) • 150 data instances • 4 dimensions • Known classifications • 3 classes • 50 instances of each

Experimental Setup Iris Data Set

Standard K-Means vs. Medoid-Based EA

Standard K-Means Clustering Iris Data Set

Medoid-Based EA Iris Data Set

Standard Fitness EA vs. Proposed Fitness EA

Fixed vs. Variable Number of Clusters EA

Variable Number of Clusters EA Iris Data Set

Conclusions • GA better at obtaining globally optimal solution • Proposed fitness function shows promise • Difficulty letting GA determine “correct” number of clusters on its own

Future Work • Other data sets • Alternative fitness function • Scalability • GA comparison to simulated annealing

A Genetic Algorithm Approach to K -Means Clustering

A Genetic Algorithm Approach to K -Means Clustering

Presentation Transcript

K-means algorithm

k -means Clustering

K-means Clustering

K-means Clustering

K-means algorithm

k - medoid clustering with genetic algorithm

K-MEANS CLUSTERING

K-Means Clustering

K-means clustering

K-means Clustering

K-means algorithm

K-means Clustering

K-means Clustering

Clustering: K-Means

Rek-means A k-means Based Clustering Algorithm

K-means algorithm

K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm | Edureka

K-means clustering

Categorical K-means Clustering Algorithm