1 / 35

CS26110 AI Toolbox

CS26110 AI Toolbox. Clustering 1. What we’ll look at. Datasets, data points, features, distance What is clustering? Partitional clustering K-means algorithm Extensions (fuzzy) Hierarchical clustering Agglomerative/Divisive Single-link, complete-link, average-link. Datasets - labelled.

eithne
Download Presentation

CS26110 AI Toolbox

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS26110AI Toolbox Clustering 1

  2. What we’ll look at • Datasets, data points, features, distance • What is clustering? • Partitional clustering • K-means algorithm • Extensions (fuzzy) • Hierarchical clustering • Agglomerative/Divisive • Single-link, complete-link, average-link

  3. Datasets - labelled Each row is an object/data point Each column is a feature/symptom/measurement = dimensions Have a class feature = decision/diagnosis

  4. Datasets - unlabelled

  5. Importance of visualisation

  6. Example: Iris dataset

  7. Example: Iris dataset

  8. Example: Iris dataset

  9. Example: Iris dataset

  10. Distance between data points

  11. What is clustering? • Clustering: the process of grouping a set of objects into classes of similar objects • Most common form of unsupervised learning • Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is given

  12. Clustering

  13. Clustering

  14. Clustering applications • Image segmentation • Social network analysis • Bioinformatics (gene expression) • Medicine • Data mining • Recommender systems • Evolutionary algorithms! (detect niches)

  15. Clustering algorithms • Partitional algorithms • Usually start with a random (partial) partitioning • Refine it iteratively • k-means clustering • Model based clustering • Hierarchical algorithms • Bottom-up, agglomerative • Top-down, divisive

  16. Clustering considerations • What does it mean for objects to be similar? • What algorithm and approach do we take? • Partitional • Hierarchical • Do we need a hierarchical arrangement of clusters? • How many clusters? • Can we label or name the clusters? • How do we make it efficient and scalable?

  17. k–means algorithm(s) • Terminology: centroid = a point that is considered to be the center of a cluster • Start by picking k, the number of clusters (centroids) • Initialize clusters by picking one point per cluster (seeds) • E.g., pick data points at random • Could also generate these randomly

  18. Populating clusters Iterate until converged • Compute distance from all data points to all k centroids • For each data point, assign it to the cluster whose current centroid it is nearest • For each centroid, compute the average (mean) of all points assigned to it • Replace the k centroids with the new averages

  19. Pick seeds Reassign clusters Compute centroids Reassign clusters x x Compute centroids x x x x k-means example (k = 2) Reassign clusters Converged!

  20. Measuring distance • Euclidean distance most often used to determine how far points are from each other (although alternatives exist) • Defined as:

  21. Distance between data points

  22. Termination conditions • Several possibilities, e.g., • A fixed number of iterations • Centroid positions don’t change (can be proven to converge) • Clusters look reasonable

  23. Exercise • Given the following 1D data: {6, 8, 18, 26, 13, 32, 24} and initial centroids: 11 and 20, perform k-means (k = 2) Iterate until converged: • Compute distance from all data points to all kcentroids • For each data point, assign it to the cluster whose current centroid it is nearest • For each centroid, compute the average (mean) of all points assigned to it • Replace the k centroids with the new averages

  24. Step 1 Compute distance from all data points to all kcentroids 1120 6 5 14 8 3 12 18 7 2 26 15 6 13 2 7 32 21 12 24 134

  25. Step 2 For each data point, assign it to the cluster whose current centroid it is nearest 1120 6 5 14 8 3 12 18 7 2 26 15 6 13 2 7 32 21 12 24 13 4

  26. Step 3 For each centroid, compute the average (mean) of all points assigned to it: 1120 1120 6 5 14 9 25 8 3 12 18 7 2 26 15 6 13 2 7 32 21 12 24 13 4

  27. Step 4 Replace the k centroids with the new averages 11 becomes 9 20 becomes 25 Have we converged?

  28. Step 1 Compute distance from all data points to all kcentroids 925 6 3 19 8 1 17 18 9 7 26 17 1 13 4 12 32 23 7 24 151

  29. Step 2 For each data point, assign it to the cluster whose current centroid it is nearest 925 6 3 19 8 1 17 18 9 7 26 17 1 13 4 12 32 23 7 24 15 1

  30. Step 3 For each centroid, compute the average (mean) of all points assigned to it: 925 925 6 3 19 9 25 8 1 17 18 9 7 26 17 1 13 4 12 32 23 7 24 15 1

  31. Step 4 Replace the k centroids with the new averages 9 becomes 9 25 becomes 25 Have we converged?

  32. Final clustering 6 8 13 18 24 26 32

  33. Issues • Was k = 2 a good choice for this data? • Were the initial centroids a good choice? 6 8 13 18 24 26 32

  34. Convergence • Why should the k-means algorithm ever reach a fixed point? • (A state in which clusters don’t change) • k-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm • EM is known to converge • Theoretically, number of iterations could be large • Typically converges quickly

  35. What to take away • What is meant by datasets, objects, features • Be able to apply k-means clustering!

More Related