Estimating the number of data clusters via the Gap statistic

Estimating the number of data clusters via the Gap statistic Robert Tibshirani, Guenther Walther and Trevor Hastie J.R. Statist. Soc. B (2001), 63, pp. 411--423

Cluster Analysis • Finding groups in data • No training data needed – Unsupervised • Major challenge – estimation of the optimal number of clusters

Wk -- Measure of compactness of clusters • Suppose we have clustered the data into k clusters, with Cr denoting the indices of observations in cluster r, and nr = |Cr| • Let • If dist is the squared Euclidean distance

Using Wk to determine # clusters

elbow • Wk decreases monotonically as the number of clusters k increases • But from some k on, the decrease flattens markedly • Such an “elbow” indicates the appropriate number of clusters

The Gap statistic • Standardize the graph of log(Wk) by comparing it to its expectation under an appropriate null reference distribution of the data • E*ndenotes expectation under a sample of size n

Reference distribution • Adopt a null model of a single component and reject it in favor of a k component model (k>1) • Two choices for the reference distribution • Generate each reference feature uniformly in the range of the observed values for that feature • Generate the reference features from a uniform distribution over a box aligned with the principal components of the data

Align with feature axes Bounding Box (aligned with feature axes) Monte Carlo Simulations Observations

Align with principal axes Bounding Box (aligned with principle axes) Monte Carlo Simulations Observations

Computation of the Gap statistic • Cluster the observed data, varying the total number of clusters from k = 1,2, …, K, giving within cluster dispersion measures Wk, k = 1,2,…, K • Generate B reference datasets, using one of the uniform prescription, and cluster each one giving W*kb, b = 1,2, …, K. Compute the (estimated) Gap statistic: • Let , compute the standard deviation , and define . Finally find the smallest k such that

2-Cluster Example

No-Cluster Example

Example on cDNA microarray data-- hierarchical clustering 6834 genes 64 human tumour

Other Approaches • Calinski and Harabasz ’74 • Krzanowski and Lai ’85 • Hartigan ’75 • Kaufman and Rousseeuw ’90 (silhouette)

Simulation (50 times) • 1 cluster: 200 points in 10-D, uniformly distributed • 3 clusters: each with 25 or 50 points in 2-D, normally distributed, w/ centers (0,0), (0,5) and (5,-3) • 4 clusters: each with 25 or 50 points in 3-D, normally distributed, w/ centers randomly chosen from N(0,5I) (simulation w/ clusters having min distance less than 1.0 was discarded.) • 4 clusters: each w/ 25 or 50 points in 10-D, normally distributed, w/ centers randomly chosen from N(0,1.9I) (simulation w/ clusters having min distance less than 1.0 was discarded.) • 2 clusters: each cluster contains 100 points in 3-D, elongated shape, well-separated

Overlapping Clusters • 50 observations from each of two bivariate normal populations with means (0,0) and (,0), and covariance I. • = 10 value in [0, 5] 10 simulations for each 

Conclusion • Focus on well-separated clusters • Outperforms other approaches, when used with a uniform reference distribution in the principal component orientation • The simpler uniform reference over the range of data works well except wen the data are concentrated on a subspace

Estimating the number of data clusters via the Gap statistic

Estimating the number of data clusters via the Gap statistic

Presentation Transcript

Estimating the Number of Data Clusters via the Gap Statistic

Met Data via the Internet

The symmetry statistic

Evaluating the Significance of Max-gap Clusters

The Gap

What is the median number of the data?

Bounding the mixing time via Spectral Gap

Modeling (estimating) the covariance of the data.

Automatically Determining the Number of Clusters in Unlabeled Data Sets

The CHI SQUARE Statistic

The PCP Theorem via gap amplification

Transboundary communication: bridging the data gap

Different Perspectives at Clustering: The “Number-of- Clusters ” Case

The Spatial Scan Statistic

Number Portability Statistic Finland Q4 2013

Estimating the age of

Estimating the Sortedness of a Data Stream

Data Privacy....Closing the Gap

General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

A Framework for Estimating the Number of Extremists in Canada

Content and the Scan Statistic for the Enron Data

Determining the number of clusters using information entropy for mixed data