350 likes | 465 Views
CS26110 AI Toolbox. Clustering 1. What we’ll look at. Datasets, data points, features, distance What is clustering? Partitional clustering K-means algorithm Extensions (fuzzy) Hierarchical clustering Agglomerative/Divisive Single-link, complete-link, average-link. Datasets - labelled.
E N D
CS26110AI Toolbox Clustering 1
What we’ll look at • Datasets, data points, features, distance • What is clustering? • Partitional clustering • K-means algorithm • Extensions (fuzzy) • Hierarchical clustering • Agglomerative/Divisive • Single-link, complete-link, average-link
Datasets - labelled Each row is an object/data point Each column is a feature/symptom/measurement = dimensions Have a class feature = decision/diagnosis
What is clustering? • Clustering: the process of grouping a set of objects into classes of similar objects • Most common form of unsupervised learning • Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is given
Clustering applications • Image segmentation • Social network analysis • Bioinformatics (gene expression) • Medicine • Data mining • Recommender systems • Evolutionary algorithms! (detect niches)
Clustering algorithms • Partitional algorithms • Usually start with a random (partial) partitioning • Refine it iteratively • k-means clustering • Model based clustering • Hierarchical algorithms • Bottom-up, agglomerative • Top-down, divisive
Clustering considerations • What does it mean for objects to be similar? • What algorithm and approach do we take? • Partitional • Hierarchical • Do we need a hierarchical arrangement of clusters? • How many clusters? • Can we label or name the clusters? • How do we make it efficient and scalable?
k–means algorithm(s) • Terminology: centroid = a point that is considered to be the center of a cluster • Start by picking k, the number of clusters (centroids) • Initialize clusters by picking one point per cluster (seeds) • E.g., pick data points at random • Could also generate these randomly
Populating clusters Iterate until converged • Compute distance from all data points to all k centroids • For each data point, assign it to the cluster whose current centroid it is nearest • For each centroid, compute the average (mean) of all points assigned to it • Replace the k centroids with the new averages
Pick seeds Reassign clusters Compute centroids Reassign clusters x x Compute centroids x x x x k-means example (k = 2) Reassign clusters Converged!
Measuring distance • Euclidean distance most often used to determine how far points are from each other (although alternatives exist) • Defined as:
Termination conditions • Several possibilities, e.g., • A fixed number of iterations • Centroid positions don’t change (can be proven to converge) • Clusters look reasonable
Exercise • Given the following 1D data: {6, 8, 18, 26, 13, 32, 24} and initial centroids: 11 and 20, perform k-means (k = 2) Iterate until converged: • Compute distance from all data points to all kcentroids • For each data point, assign it to the cluster whose current centroid it is nearest • For each centroid, compute the average (mean) of all points assigned to it • Replace the k centroids with the new averages
Step 1 Compute distance from all data points to all kcentroids 1120 6 5 14 8 3 12 18 7 2 26 15 6 13 2 7 32 21 12 24 134
Step 2 For each data point, assign it to the cluster whose current centroid it is nearest 1120 6 5 14 8 3 12 18 7 2 26 15 6 13 2 7 32 21 12 24 13 4
Step 3 For each centroid, compute the average (mean) of all points assigned to it: 1120 1120 6 5 14 9 25 8 3 12 18 7 2 26 15 6 13 2 7 32 21 12 24 13 4
Step 4 Replace the k centroids with the new averages 11 becomes 9 20 becomes 25 Have we converged?
Step 1 Compute distance from all data points to all kcentroids 925 6 3 19 8 1 17 18 9 7 26 17 1 13 4 12 32 23 7 24 151
Step 2 For each data point, assign it to the cluster whose current centroid it is nearest 925 6 3 19 8 1 17 18 9 7 26 17 1 13 4 12 32 23 7 24 15 1
Step 3 For each centroid, compute the average (mean) of all points assigned to it: 925 925 6 3 19 9 25 8 1 17 18 9 7 26 17 1 13 4 12 32 23 7 24 15 1
Step 4 Replace the k centroids with the new averages 9 becomes 9 25 becomes 25 Have we converged?
Final clustering 6 8 13 18 24 26 32
Issues • Was k = 2 a good choice for this data? • Were the initial centroids a good choice? 6 8 13 18 24 26 32
Convergence • Why should the k-means algorithm ever reach a fixed point? • (A state in which clusters don’t change) • k-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm • EM is known to converge • Theoretically, number of iterations could be large • Typically converges quickly
What to take away • What is meant by datasets, objects, features • Be able to apply k-means clustering!