Cluster Validation Methods in Big Data Analysis

CSE 482: Big Data Analysis Lecture 13: Clustering II

Outline • Previous lecture • What is clustering? • K-means clustering • Today’s lecture • Cluster validation • Hierarchical clustering

Cluster Validity • For supervised classification we have a variety of measures to evaluate how good our model is • Accuracy, True Positive Rate, etc • For cluster analysis, how to evaluate the “goodness” of the resulting clusters? • Challenging because the “clusters are in the eye of the beholder”!

How many clusters? Six Clusters Two Clusters Four Clusters Notion of a Cluster can be Ambiguous

DBSCAN K-means Complete Link Clusters can be found even in Random Data Random Data

Issues in Cluster Validation • Issues • How many clusters are there in the data? • Are the clusters real or are they nothing more than some “accidental” groupings of the data? • What we need • A measure of cluster quality • A statistical approach for testing validity of the clusters

Sum of Squared Error (SSE) Centroids Movie Ratings d(U1,Cluster1)2

Sum of Squared Error (SSE) Use the “elbow” of the curve to identify the number of clusters

Sum of Squared Error (SSE) • Another example

SSE • SSE curve for a more complicated data set (harder to identify number of clusters) SSE of clusters found using K-means

Framework for Cluster Validity • Need a statistical framework to interpret a measure • For example, if the measure (SSE or description length) gives a value of 10, does that imply the clusters are good, fair, or poor? • What would be the expected value of the measure if we apply clustering on random data?

Statistical Framework for Cluster Validity • Example • Consider a 2-dimensional data that contains three well-separated clusters • Total number of data instances is 100 • SSE obtained using k-means is 0.005

Statistical Framework for Cluster Validity • Example • Compare SSE of 0.005 against three clusters in random data • Histogram shows SSE of 500 sets of random data points of size 100 distributed over the range 0.2 – 0.8 for x and y values SSE = 0.005 Results showed that it is highly unlikely the data is generated randomly, which means the clusters are likely to be real (not just some spurious clusters found by k-means)

External Measures of Cluster Validity • SSE and description length are examples of internal measures of cluster validity • Ground truth of the clusters are unknown • Sometimes, you may have ground truth clustering information available for some of the instances • So, we can evaluate how well the clustering algorithm performs by comparing the clusters found against the ground truth • Measures that rely on the ground truth to evaluate quality of clustering are called external measures

External Measure: Rand Index • Ground truth classes: {Y1, Y2, Y3, …, Yc} Clustering solution: {C1, C2, C3, …, Ck} Rand index = # pairs in the same cluster # pairs in different clusters # pairs in the same class # pairs in different classes # pairs

External Measure: Adjusted Rand Index • Adjust for chance occurrence Adjusted Rand index =

Example

Hierarchical Clustering • Produces nested clusters organized as a tree • Does not assume any particular number of clusters • Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level • The tree may correspond to meaningful taxonomies

Partitional vs Hierarchical Clustering Instance Features/Attributes Dog Dog Human Cat Cat Monkey Human Monkey Original Points A Partitional Clustering Dog Human Cat Monkey Dog Monkey Human Cat Dendrogram Hierarchical Clustering

Agglomerative Hierarchical Clustering • Clusters are generated in a bottom-up fashion • Compute the proximity matrix • Let each data point be a cluster • Repeat • Merge the two closest clusters • Update the proximity matrix • Until only a single cluster remains • Key operation is the computation of the proximity of two clusters • There are many ways to define proximity between clusters

Input/Output of Hierarchical Clustering Proximity matrix Instance Cat Monkey Compute proximity Human Instance Dog Dog Dog Cat Cat Human Human Monkey Monkey Hierarchical Clustering Dog Human Monkey Cat

p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Proximity Proximity? • MIN • MAX • Ward’s Method Proximity Matrix

p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Proximity • MIN • MAX • Ward’s Method Proximity Matrix MIN (single-link): distance between 2 clusters A and B is given by shortest distance between a data point in A and a data point in B

p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Proximity • MIN • MAX • Ward’s Method Proximity Matrix MAX (complete-link): distance between 2 clusters A and B is given by largest distance between a data point in A and a data point in B

p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Proximity • MIN • MAX • Ward’s Method Proximity Matrix Ward’s method: distance between 2 clusters A and B is given by change in the SSE after merging the two clusters, i.e., Distance = SSE (A  B) - SSE (A) + SSE(B)

MIN or Single Link • Proximity of two clusters is based on the two closest points in the different clusters • Determined by one pair of points, i.e., by one link in the proximity graph • Example: Distance Matrix:

5 1 3 5 2 1 2 3 6 4 4 Hierarchical Clustering: MIN Nested Clusters Dendrogram

Two Clusters Strength of MIN Original Points • Can handle non-elliptical shapes

Two Clusters Limitations of MIN Original Points • Sensitive to noise

MAX or Complete Linkage • Similarity of two clusters is based on the two most distant points in the different clusters • Determined by all pairs of points in the two clusters Distance Matrix:

4 1 2 5 5 2 3 6 3 1 4 Hierarchical Clustering: MAX Nested Clusters Dendrogram

Two Clusters Strength of MAX Original Points • Less susceptible to noise

Two Clusters Limitations of MAX Original Points • Tends to break large clusters

Hierarchical Clustering: Ward’s Method • Proximity of two clusters is based on the increase in sum of squared error when two clusters are merged • Less susceptible to noise • Biased towards globular clusters

5 4 1 2 5 2 3 6 1 4 3 Hierarchical Clustering: Ward’s Method Nested Clusters Dendrogram

5 1 5 4 1 3 4 1 2 5 2 5 5 2 1 2 5 2 2 3 3 6 6 3 1 6 3 4 4 1 3 4 4 Hierarchical Clustering: Comparison MIN MAX Ward’s Method

Hierarchical Clustering – Details • Hierarchical clustering generates a nested set of clusters from 1 to N clusters (N: number of data instances) • To generate k clusters, you need to “cut” the dendrogram at level k • Computation and storage requirements are more expensive compared to k-means • O(N3) time and O(N2) space

Python Example

Python Example • Use scipy implementation for hierarchical clustering

Python Example • You can create “flat” clusters from the hierarchy • Need to define a threshold on the dendrogram • The larger the threshold, the fewer the number of clusters obtained • You need to vary the threshold until you get the desired number of clusters Z = hierarchy.linkage(X.as_matrix(), 'complete') threshold = 1.1 labels = hierarchy.fcluster(Z, threshold)

Python Example Check to see the number of clusters given the threshold Adjusted Rand Index improves with larger number of clusters

Cluster Validation Methods in Big Data Analysis