1 / 41

Cluster Validation Methods in Big Data Analysis

Explore cluster validation methods in big data analysis, including Sum of Squared Error (SSE), statistical frameworks, and external measures of cluster validity. Learn about Hierarchical Clustering and its input/output, comparing partitional and hierarchical clustering techniques.

Download Presentation

Cluster Validation Methods in Big Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. CSE 482: Big Data Analysis Lecture 13: Clustering II

  2. Outline • Previous lecture • What is clustering? • K-means clustering • Today’s lecture • Cluster validation • Hierarchical clustering

  3. Cluster Validity • For supervised classification we have a variety of measures to evaluate how good our model is • Accuracy, True Positive Rate, etc • For cluster analysis, how to evaluate the “goodness” of the resulting clusters? • Challenging because the “clusters are in the eye of the beholder”!

  4. How many clusters? Six Clusters Two Clusters Four Clusters Notion of a Cluster can be Ambiguous

  5. DBSCAN K-means Complete Link Clusters can be found even in Random Data Random Data

  6. Issues in Cluster Validation • Issues • How many clusters are there in the data? • Are the clusters real or are they nothing more than some “accidental” groupings of the data? • What we need • A measure of cluster quality • A statistical approach for testing validity of the clusters

  7. Sum of Squared Error (SSE) Centroids Movie Ratings d(U1,Cluster1)2

  8. Sum of Squared Error (SSE) Use the “elbow” of the curve to identify the number of clusters

  9. Sum of Squared Error (SSE) • Another example

  10. SSE • SSE curve for a more complicated data set (harder to identify number of clusters) SSE of clusters found using K-means

  11. Framework for Cluster Validity • Need a statistical framework to interpret a measure • For example, if the measure (SSE or description length) gives a value of 10, does that imply the clusters are good, fair, or poor? • What would be the expected value of the measure if we apply clustering on random data?

  12. Statistical Framework for Cluster Validity • Example • Consider a 2-dimensional data that contains three well-separated clusters • Total number of data instances is 100 • SSE obtained using k-means is 0.005

  13. Statistical Framework for Cluster Validity • Example • Compare SSE of 0.005 against three clusters in random data • Histogram shows SSE of 500 sets of random data points of size 100 distributed over the range 0.2 – 0.8 for x and y values SSE = 0.005 Results showed that it is highly unlikely the data is generated randomly, which means the clusters are likely to be real (not just some spurious clusters found by k-means)

  14. External Measures of Cluster Validity • SSE and description length are examples of internal measures of cluster validity • Ground truth of the clusters are unknown • Sometimes, you may have ground truth clustering information available for some of the instances • So, we can evaluate how well the clustering algorithm performs by comparing the clusters found against the ground truth • Measures that rely on the ground truth to evaluate quality of clustering are called external measures

  15. External Measure: Rand Index • Ground truth classes: {Y1, Y2, Y3, …, Yc} Clustering solution: {C1, C2, C3, …, Ck} Rand index = # pairs in the same cluster # pairs in different clusters # pairs in the same class # pairs in different classes # pairs

  16. External Measure: Adjusted Rand Index • Adjust for chance occurrence Adjusted Rand index =

  17. Example

  18. Hierarchical Clustering • Produces nested clusters organized as a tree • Does not assume any particular number of clusters • Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level • The tree may correspond to meaningful taxonomies

  19. Partitional vs Hierarchical Clustering Instance Features/Attributes Dog Dog Human Cat Cat Monkey Human Monkey Original Points A Partitional Clustering Dog Human Cat Monkey Dog Monkey Human Cat Dendrogram Hierarchical Clustering

  20. Agglomerative Hierarchical Clustering • Clusters are generated in a bottom-up fashion • Compute the proximity matrix • Let each data point be a cluster • Repeat • Merge the two closest clusters • Update the proximity matrix • Until only a single cluster remains • Key operation is the computation of the proximity of two clusters • There are many ways to define proximity between clusters

  21. Input/Output of Hierarchical Clustering Proximity matrix Instance Cat Monkey Compute proximity Human Instance Dog Dog Dog Cat Cat Human Human Monkey Monkey Hierarchical Clustering Dog Human Monkey Cat

  22. p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Proximity Proximity? • MIN • MAX • Ward’s Method Proximity Matrix

  23. p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Proximity • MIN • MAX • Ward’s Method Proximity Matrix MIN (single-link): distance between 2 clusters A and B is given by shortest distance between a data point in A and a data point in B

  24. p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Proximity • MIN • MAX • Ward’s Method Proximity Matrix MAX (complete-link): distance between 2 clusters A and B is given by largest distance between a data point in A and a data point in B

  25. p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Proximity • MIN • MAX • Ward’s Method Proximity Matrix Ward’s method: distance between 2 clusters A and B is given by change in the SSE after merging the two clusters, i.e., Distance = SSE (A  B) - SSE (A) + SSE(B)

  26. MIN or Single Link • Proximity of two clusters is based on the two closest points in the different clusters • Determined by one pair of points, i.e., by one link in the proximity graph • Example: Distance Matrix:

  27. 5 1 3 5 2 1 2 3 6 4 4 Hierarchical Clustering: MIN Nested Clusters Dendrogram

  28. Two Clusters Strength of MIN Original Points • Can handle non-elliptical shapes

  29. Two Clusters Limitations of MIN Original Points • Sensitive to noise

  30. MAX or Complete Linkage • Similarity of two clusters is based on the two most distant points in the different clusters • Determined by all pairs of points in the two clusters Distance Matrix:

  31. 4 1 2 5 5 2 3 6 3 1 4 Hierarchical Clustering: MAX Nested Clusters Dendrogram

  32. Two Clusters Strength of MAX Original Points • Less susceptible to noise

  33. Two Clusters Limitations of MAX Original Points • Tends to break large clusters

  34. Hierarchical Clustering: Ward’s Method • Proximity of two clusters is based on the increase in sum of squared error when two clusters are merged • Less susceptible to noise • Biased towards globular clusters

  35. 5 4 1 2 5 2 3 6 1 4 3 Hierarchical Clustering: Ward’s Method Nested Clusters Dendrogram

  36. 5 1 5 4 1 3 4 1 2 5 2 5 5 2 1 2 5 2 2 3 3 6 6 3 1 6 3 4 4 1 3 4 4 Hierarchical Clustering: Comparison MIN MAX Ward’s Method

  37. Hierarchical Clustering – Details • Hierarchical clustering generates a nested set of clusters from 1 to N clusters (N: number of data instances) • To generate k clusters, you need to “cut” the dendrogram at level k • Computation and storage requirements are more expensive compared to k-means • O(N3) time and O(N2) space

  38. Python Example

  39. Python Example • Use scipy implementation for hierarchical clustering

  40. Python Example • You can create “flat” clusters from the hierarchy • Need to define a threshold on the dendrogram • The larger the threshold, the fewer the number of clusters obtained • You need to vary the threshold until you get the desired number of clusters Z = hierarchy.linkage(X.as_matrix(), 'complete') threshold = 1.1 labels = hierarchy.fcluster(Z, threshold)

  41. Python Example Check to see the number of clusters given the threshold Adjusted Rand Index improves with larger number of clusters

More Related