1 / 70

Topic 6 Clustering and Unsupervised Learning

Topic 6 Clustering and Unsupervised Learning. credits: Padhraic Smyth lecture notes Hand, et al Chapter 9 David Madigan lecture notes. Clustering Outline. Introduction to Clustering Distance measures k-means clustering hierarchical clustering probabilistic clustering. Clustering.

lakia
Download Presentation

Topic 6 Clustering and Unsupervised Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic 6Clustering and Unsupervised Learning credits: Padhraic Smyth lecture notes Hand, et al Chapter 9 David Madigan lecture notes Data Mining - Volinsky - 2011 - Columbia University

  2. Clustering Outline • Introduction to Clustering • Distance measures • k-means clustering • hierarchical clustering • probabilistic clustering Data Mining - Volinsky - 2011 - Columbia University

  3. Clustering • “automated detection of group structure in data” • Typically: partition N data points into K groups (clusters) such that the points in each group are more similar to each other than to points in other groups • descriptive technique (contrast with predictive) • Identify “natural” groups of data objects - qualitatively describe groups of the data • often useful, if a bit reductionist • for real-valued vectors, clusters can be thought of as clouds of points in p-dimensional space • Also called unsupervised learning Data Mining - Volinsky - 2011 - Columbia University

  4. Clustering Data Mining - Volinsky - 2011 - Columbia University

  5. What is Cluster Analysis? • A good cluster analysis results in • Similar (close) to one another within the same cluster • Dissimilar (far) from the objects in other clusters • In other words • high intra-cluster similarity (low intra-cluster variance) • low inter-cluster similarity (high inter-cluster variance) Data Mining - Volinsky - 2011 - Columbia University

  6. Example Data Mining - Volinsky - 2011 - Columbia University

  7. Example Data Mining - Volinsky - 2011 - Columbia University

  8. Why is Clustering useful? • “Discovery” of new knowledge from data • Contrast with supervised classification (where labels are known) • Can be very useful for summarizing large data sets • For large n and/or high dimensionality • Applications of clustering • WWW • Clustering of documents produced by a search engine (google news) • Customer Segmentation • Spatial data Analysis • geographical clusters of events: cancer rates, sales, etc. • Clustering of genes with similar expression profiles • many more Data Mining - Volinsky - 2011 - Columbia University

  9. General Issues in Clustering • No golden truth! • answer is often subjective • Cluster Representation: • What types or “shapes” of clusters are we looking for? What defines a cluster? • Other issues • Distance function, D[x(i),x(j)] critical aspect of clustering, both • distance of individual pairs of objects • distance of individual objects from clusters • How is K selected? Data Mining - Volinsky - 2011 - Columbia University

  10. Clustering Outline • Introduction to Clustering • Distance measures • k-means clustering • hierarchical clustering • probabalistic clustering Data Mining - Volinsky - 2011 - Columbia University

  11. Distance Measures • In order to cluster, we need some kind of “distance” between points. • Sometimes distances are not obvious, but we can create them Data Mining - Volinsky - 2011 - Columbia University

  12. Some Distances • Euclidean distance (L2): d(x,y) = • The most common notion of “distance.” • Manhattan distance (L1) • distance if you had to travel along coordinates only. Data Mining - Volinsky - 2011 - Columbia University

  13. Examples of Euclidean Distances y = (9,8) L2: dist(x,y) = (42+32) = 5 5 3 L1: dist(x,y) = 4+3 = 7 4 x = (5,5) Data Mining - Volinsky - 2011 - Columbia University

  14. Non-Euclidean Distances • Some observations are not appropriate for Euclidian distance: • Binary Vectors: 10011 vs. 11000 • Strings: “Statistics” vs. “sadistics” • Ordinal variables: “M.S” vs. “B.A.” • Categorical: blue vs. green How to calculate distances for variables like these? Data Mining - Volinsky - 2011 - Columbia University

  15. Distances for Binary Vectors • A=101110; B=100111 • Hamming distance: # of changes to get from A to B • Hamming(A,B) = 2 • Can be normalized by length of string: 2/6 • Jaccard Similarity: intersection over union • Intersection: # of 1s in common =3 • Union: # of spaces with at least one 1 = 5 • Jaccard similarity = 3/4. • Jaccard distance = 1-3/4 = ¼ • Both of these are metrics => satisfy triangle inequality Data Mining - Volinsky - 2011 - Columbia University

  16. Cosine Distance (similarity) • Think of a point as a vector from the origin (0,0,…,0) to its location. • Two points’ vectors make an angle, the cosine of this angle is a measure of similarity • Recall cos(0) = 1; cos(90)=0 • Also: the cosine is the normalized dot-product of the vectors: • Example p1 = 00111; p2 = 10011. • cos() = 2/3;  is about 48 degrees. Data Mining - Volinsky - 2011 - Columbia University

  17. Cosine-Measure Diagram p1  p2 p1.p2 ||p2|| Data Mining - Volinsky - 2011 - Columbia University

  18. Edit Distance for strings • Hamming distance for strings: the number of inserts and deletes of characters needed to turn one into the other. • Equivalently: d(x,y) = |x| + |y| - 2|LCS(x,y)|. • LCS = longest common subsequence = longest string obtained both by deleting from x and deleting from y. Data Mining - Volinsky - 2011 - Columbia University

  19. Example • x = statistics; y = sadistic. • Turn x into y by deleting t, deleting t, then inserting d, and deleting s. • Edit-distance = 4. • Or, LCS(x,y) = saistic. • |x| + |y| - 2|LCS(x,y)| = 10+8-14=4. Data Mining - Volinsky - 2011 - Columbia University

  20. Categorical Variables • A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green • Method 1: Simple matching • m: # of matches, p: total # of variables • Distance(Alice,Craig) = 4-3/4 = 1/4 Data Mining - Volinsky - 2011 - Columbia University

  21. Ordinal Variables • An ordinal variable can be discrete or continuous • order is important, e.g., rank • Pretend they are interval scaled • replacing xif by their rank • map the range of each variable onto [0, 1]: • compute the dissimilarity using methods using Euclidean or other distance Data Mining - Volinsky - 2011 - Columbia University

  22. Clustering Methods • enough about distances! • Now we have a matrix (n x n) of distances. • Two major types of clustering algorithms: • partitioning • Partitions the set into clusters with defined boundaries • place each point in its nearest cluster • hierarchical • agglomerative: each point is in its own cluser, iteratively combine • divisive: all data in one cluser, iteratively dissect Data Mining - Volinsky - 2011 - Columbia University

  23. Clustering Outline • Introduction to Clustering • Distance measures • k-means Clustering • Hierarchical clustering • Probabalistic clustering Data Mining - Volinsky - 2011 - Columbia University

  24. k –Means Algorithm(s) • Assumes Euclidean space. • Start by picking k, the number of clusters. • Initialize clusters by picking one point per cluster. • typically, k random points Data Mining - Volinsky - 2011 - Columbia University

  25. K-Means Algorithm • Arbitrarily select K objects from the data (e.g., K customer) to be each a cluster center • For each of the remaining objects: Assign each object to the cluster whose center it is most close to Cluster center Cluster center Data Mining - Volinsky - 2011 - Columbia University

  26. K-Means Algorithm Then Repeat the following 3 steps until clusters converge (no change in clusters): • Compute the new center of the current clusters. Data Mining - Volinsky - 2011 - Columbia University

  27. K-Means Algorithm 2. Assign each object to the cluster whose center it is most close to. 3. Go back to Step 1, or stop if center do not change. Data Mining - Volinsky - 2011 - Columbia University

  28. The K-Means Clustering Method • Example Data Mining - Volinsky - 2011 - Columbia University

  29. K-means • Decide on clusters. (e.g. K=5) • (Example is courtesy of Andrew Moore, CMU) Data Mining - Volinsky - 2011 - Columbia University

  30. K-means • Decide on Clusters. (e.g. K=5) • Randomly guess K cluster Center locations Data Mining - Volinsky - 2011 - Columbia University

  31. K-means • Decide on clusters. (e.g. K=5) • Randomly guess K cluster Center locations • Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints) Data Mining - Volinsky - 2011 - Columbia University

  32. K-means • Decide on clusters. (e.g. k=5) • Randomly guess k cluster Center locations • Each datapoint finds out which Center it’s closest to. • Each Center finds the centroid of the points it owns Data Mining - Volinsky - 2011 - Columbia University

  33. K-means • Decide on clusters. (e.g. k=5) • Randomly guess k cluster Center locations • Each datapoint finds out which Center it’s closest to. • Each Center finds the centroid of the points it owns • New Centers => new boundaries • Repeat until no change Data Mining - Volinsky - 2011 - Columbia University

  34. K-Means Example • Given: {2,4,10,12,3,20,30,11,25}, k=2 • Randomly assign means: m1=3,m2=4 • Solve for the rest …. Data Mining - Volinsky - 2011 - Columbia University

  35. K-Means Example • Given: {2,4,10,12,3,20,30,11,25}, k=2 • Randomly assign means: m1=3,m2=4 • K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16 • K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18 • K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6 • K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25 • Stop as the clusters with these means are the same. Data Mining - Volinsky - 2011 - Columbia University

  36. Best value of k Average distance to centroid k Getting k Right • Hard! Often done subjectively (by feel) • Try different k, looking at the change in the average distance to centroid, as k increases. • Looking for a balance between within-cluster variance and between-cluster variance. • Calinski Index = • Average falls rapidly until right k, then changes little. Data Mining - Volinsky - 2011 - Columbia University

  37. Comments on the K-Means Method • Strength • Relatively efficient: Easy to implement - often comes up with good, if not best, solutions • intuitive • Weakness • Need to specify k, the number of clusters, in advance • Unable to handle noisy data and outliers • Not suitable to discover clusters with non-convex shapes • Quite sensitive to initial starting points - will find a local optimum. Do it several times and see how much the results change Data Mining - Volinsky - 2011 - Columbia University

  38. Variations on k-means • Make it more robust by using k-modes or k-mediods • K-Medoids: medoids are the most centrally located object in a cluster. • Make the initialization better • Take a small random sample and cluster to find a starting point • Pick k points on a grid, • Do several runs with different starting points Data Mining - Volinsky - 2011 - Columbia University

  39. Clustering Outline • Introduction to Clustering • Distance measures • k-means clustering • Hierarchical clustering • Probabalistic clustering Data Mining - Volinsky - 2011 - Columbia University

  40. Simple example of hierarchical clustering Data Mining - Volinsky - 2011 - Columbia University

  41. Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative a a b b a b c d e c c d e d d e e divisive Step 3 Step 2 Step 1 Step 0 Step 4 Hierarchical Clustering • Does not require the number of clusters k as an input. • Two extremes • All data in one cluster • Each data point in its own cluster Data Mining - Volinsky - 2011 - Columbia University

  42. Hierarchical Clustering • Representation: tree of nested clusters • Greedy algorithm • Find two most similar points • Join them • Repeat • Can also run backwards • divisive • Effective visualization via “dendrograms” • shows nesting structure • merges or splits = tree nodes • Algorithm requires a distance metric for distance between clusters, or between a point and a cluster Data Mining - Volinsky - 2011 - Columbia University

  43. Distances Between Clusters • Single Link: • smallest distance between points • Nearest neighbor • can be outlier sensitive • Complete Link: • largest distance between points • enforces “compactness” • Average Link: • mean - gets average behavior • centroid - more robust • Ward’s measure • Merge clusters that minimize increase in within-cluster distances: • D(i,j) = ( (SS(Ci+j) - SS(Cj) - SS(Ci) ) Data Mining - Volinsky - 2011 - Columbia University

  44. Dendrograms • By cutting the dendrogram at the desired level, then each connected component forms a cluster. ABC DEF G HI A B C D E F G H I

  45. Old Faithful data Data Mining - Volinsky - 2011 - Columbia University

  46. Can make plots so the height of the cross-bar shows the change in within-cluster SS Data Mining - Volinsky - 2011 - Columbia University

  47. Dendrogram Using Single-Link Method Old Faithful Eruption Duration vs Wait Data Notice how single-link tends to “chain”. dendrogram y-axis = crossbar’s distance score Data Mining - Volinsky - 2011 - Columbia University

  48. Dendogram Using Ward’s SSE Distance More balanced than single-link. Old Faithful Eruption Duration vs Wait Data Data Mining - Volinsky - 2011 - Columbia University

  49. Hierarchical Clustering • Pros • don’t have to specify k beforehand • visual representation of various cluster characteristics from dendogram • Cons • different linkage options get very different results Data Mining - Volinsky - 2011 - Columbia University

  50. Clustering Outline • Introduction to Clustering • Distance measures • k-means clustering • Hierarchical clustering • Probabalistic clustering Data Mining - Volinsky - 2011 - Columbia University

More Related