CS26110 AI Toolbox

CS26110AI Toolbox Clustering 1

What we’ll look at • Datasets, data points, features, distance • What is clustering? • Partitional clustering • K-means algorithm • Extensions (fuzzy) • Hierarchical clustering • Agglomerative/Divisive • Single-link, complete-link, average-link

Datasets - labelled Each row is an object/data point Each column is a feature/symptom/measurement = dimensions Have a class feature = decision/diagnosis

Datasets - unlabelled

Importance of visualisation

Example: Iris dataset

Distance between data points

What is clustering? • Clustering: the process of grouping a set of objects into classes of similar objects • Most common form of unsupervised learning • Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is given

Clustering

Clustering applications • Image segmentation • Social network analysis • Bioinformatics (gene expression) • Medicine • Data mining • Recommender systems • Evolutionary algorithms! (detect niches)

Clustering algorithms • Partitional algorithms • Usually start with a random (partial) partitioning • Refine it iteratively • k-means clustering • Model based clustering • Hierarchical algorithms • Bottom-up, agglomerative • Top-down, divisive

Clustering considerations • What does it mean for objects to be similar? • What algorithm and approach do we take? • Partitional • Hierarchical • Do we need a hierarchical arrangement of clusters? • How many clusters? • Can we label or name the clusters? • How do we make it efficient and scalable?

k–means algorithm(s) • Terminology: centroid = a point that is considered to be the center of a cluster • Start by picking k, the number of clusters (centroids) • Initialize clusters by picking one point per cluster (seeds) • E.g., pick data points at random • Could also generate these randomly

Populating clusters Iterate until converged • Compute distance from all data points to all k centroids • For each data point, assign it to the cluster whose current centroid it is nearest • For each centroid, compute the average (mean) of all points assigned to it • Replace the k centroids with the new averages

Pick seeds Reassign clusters Compute centroids Reassign clusters x x Compute centroids x x x x k-means example (k = 2) Reassign clusters Converged!

Measuring distance • Euclidean distance most often used to determine how far points are from each other (although alternatives exist) • Defined as:

Distance between data points

Termination conditions • Several possibilities, e.g., • A fixed number of iterations • Centroid positions don’t change (can be proven to converge) • Clusters look reasonable

Exercise • Given the following 1D data: {6, 8, 18, 26, 13, 32, 24} and initial centroids: 11 and 20, perform k-means (k = 2) Iterate until converged: • Compute distance from all data points to all kcentroids • For each data point, assign it to the cluster whose current centroid it is nearest • For each centroid, compute the average (mean) of all points assigned to it • Replace the k centroids with the new averages

Step 1 Compute distance from all data points to all kcentroids 1120 6 5 14 8 3 12 18 7 2 26 15 6 13 2 7 32 21 12 24 134

Step 2 For each data point, assign it to the cluster whose current centroid it is nearest 1120 6 5 14 8 3 12 18 7 2 26 15 6 13 2 7 32 21 12 24 13 4

Step 3 For each centroid, compute the average (mean) of all points assigned to it: 1120 1120 6 5 14 9 25 8 3 12 18 7 2 26 15 6 13 2 7 32 21 12 24 13 4

Step 4 Replace the k centroids with the new averages 11 becomes 9 20 becomes 25 Have we converged?

Step 1 Compute distance from all data points to all kcentroids 925 6 3 19 8 1 17 18 9 7 26 17 1 13 4 12 32 23 7 24 151

Step 2 For each data point, assign it to the cluster whose current centroid it is nearest 925 6 3 19 8 1 17 18 9 7 26 17 1 13 4 12 32 23 7 24 15 1

Step 3 For each centroid, compute the average (mean) of all points assigned to it: 925 925 6 3 19 9 25 8 1 17 18 9 7 26 17 1 13 4 12 32 23 7 24 15 1

Step 4 Replace the k centroids with the new averages 9 becomes 9 25 becomes 25 Have we converged?

Final clustering 6 8 13 18 24 26 32

Issues • Was k = 2 a good choice for this data? • Were the initial centroids a good choice? 6 8 13 18 24 26 32

Convergence • Why should the k-means algorithm ever reach a fixed point? • (A state in which clusters don’t change) • k-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm • EM is known to converge • Theoretically, number of iterations could be large • Typically converges quickly

What to take away • What is meant by datasets, objects, features • Be able to apply k-means clustering!

CS26110 AI Toolbox