190 likes | 315 Views
Get into pairs, please! Person A explain to Person B what datasets are and what clustering is about Person B explain to Person A how the k -means algorithm works. CS26110 AI Toolbox. Clustering 2. Clustering lectures overview. Datasets, data points, dimensionality, distance
E N D
Get into pairs, please! • Person A explain to Person B what datasets are and what clustering is about • Person B explain to Person A how the k-means algorithm works
CS26110AI Toolbox Clustering 2
Clustering lectures overview • Datasets, data points, dimensionality, distance • What is clustering? • Partitional clustering • k-means algorithm • Extensions (fuzzy) • Hierarchical clustering • Agglomerative/Divisive • Single-link, complete-link, average-link
Today • Investigate the parameters of k-means clustering • Think about the limitations of k-means • Think about how to decide if a clustering is good or not
Exercise • Given the following 1D data: {6, 8, 18, 28, 12, 32, 24} choose your own initial centroids and perform k-means • (stay within the range 6 to 32) Iterate until converged: • Compute distance from all data points to all kcentroids • For each data point, assign it to the cluster whose current centroid it is nearest • For each centroid, compute the average (mean) of all points assigned to it • Replace the k centroids with the new averages
Final clustering Initial centroids 18 20 6 8 12 18 24 28 32
Previous clustering Initial centroids 11 20 6 8 13 18 24 26 32
Initial seed choice • Results can vary based on random seed selection • Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings • Select good seeds using a heuristic • Try out multiple starting points • Initialize with the results of another method In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} and {C,F}
Exercise • Given the following 1D data: {6, 8, 18, 28, 7, 32, 22}choose your own centroids and perform k-means • Choose a value of k: 2, 3, or 4 Iterate until converged: • Compute distance from all data points to all kcentroids • For each data point, assign it to the cluster whose current centroid it is nearest • For each centroid, compute the average (mean) of all points assigned to it • Replace the k centroids with the new averages
What this looks like... 6 7 8 18 22 28 32
How many clusters? • Number of clusters k is required at the start • Finding the “right” number of clusters is part of the problem • Given data, partition into an “appropriate” number of subsets • Trade-off between having more clusters (better focus within each cluster) and having too many clusters
Time complexity • Computing distance between ndata points and centroid is O(nm) • Where m is the dimensionality of the data points • Reassigning clusters • For each k, do the above = O(knm) in total • Computing centroids • Each point gets assigned to one centroid: O(nm) • Assume these steps are each performed once for I iterations: O(Iknm)
Limitations • Must choose parameter k in advance, or try many values • This is a particular problem for k-means as often the optimal number of clusters is not known • Data must be numerical and must be compared via a suitable distance measure
Limitations • The algorithm works best on data which contains spherical clusters; clusters with other geometry may not be found • The algorithm is sensitive to outliers/points which do not belong in any cluster • These can distort the centroid positions and ruin the clustering
Cluster validity 6 7 8 18 22 28 32 6 7 8 18 22 28 32
Cluster validity 6 7 8 18 22 28 32 6 7 8 18 22 28 32
Cluster validity: what we want! • High inter-cluster distances • Large distance between clusters • Otherwise known as good separability • Low intra-cluster distances • Distances between data points within a cluster should be relatively low • Otherwise known as good compactness • Many cluster validity measures have been developed
To think about... • Can GAs be used for partitional clustering? • What does a ‘solution’ to the clustering problem look like? • How would you encode this? • What fitness function would you use?
What to take away • Be able to apply k-means clustering • Understand the issues involved in k-means clustering • Parameters, limitations • Analyse simple clusters for validity • Inter-cluster distance vs intra-cluster distance