200 likes | 292 Views
Estimating the number of data clusters via the Gap statistic. Robert Tibshirani, Guenther Walther and Trevor Hastie J.R. Statist. Soc. B (2001), 63, pp. 411--423. Cluster Analysis. Finding groups in data No training data needed – Unsupervised
E N D
Estimating the number of data clusters via the Gap statistic Robert Tibshirani, Guenther Walther and Trevor Hastie J.R. Statist. Soc. B (2001), 63, pp. 411--423
Cluster Analysis • Finding groups in data • No training data needed – Unsupervised • Major challenge – estimation of the optimal number of clusters
Wk -- Measure of compactness of clusters • Suppose we have clustered the data into k clusters, with Cr denoting the indices of observations in cluster r, and nr = |Cr| • Let • If dist is the squared Euclidean distance
elbow • Wk decreases monotonically as the number of clusters k increases • But from some k on, the decrease flattens markedly • Such an “elbow” indicates the appropriate number of clusters
The Gap statistic • Standardize the graph of log(Wk) by comparing it to its expectation under an appropriate null reference distribution of the data • E*ndenotes expectation under a sample of size n
Reference distribution • Adopt a null model of a single component and reject it in favor of a k component model (k>1) • Two choices for the reference distribution • Generate each reference feature uniformly in the range of the observed values for that feature • Generate the reference features from a uniform distribution over a box aligned with the principal components of the data
Align with feature axes Bounding Box (aligned with feature axes) Monte Carlo Simulations Observations
Align with principal axes Bounding Box (aligned with principle axes) Monte Carlo Simulations Observations
Computation of the Gap statistic • Cluster the observed data, varying the total number of clusters from k = 1,2, …, K, giving within cluster dispersion measures Wk, k = 1,2,…, K • Generate B reference datasets, using one of the uniform prescription, and cluster each one giving W*kb, b = 1,2, …, K. Compute the (estimated) Gap statistic: • Let , compute the standard deviation , and define . Finally find the smallest k such that
Example on cDNA microarray data-- hierarchical clustering 6834 genes 64 human tumour
Other Approaches • Calinski and Harabasz ’74 • Krzanowski and Lai ’85 • Hartigan ’75 • Kaufman and Rousseeuw ’90 (silhouette)
Simulation (50 times) • 1 cluster: 200 points in 10-D, uniformly distributed • 3 clusters: each with 25 or 50 points in 2-D, normally distributed, w/ centers (0,0), (0,5) and (5,-3) • 4 clusters: each with 25 or 50 points in 3-D, normally distributed, w/ centers randomly chosen from N(0,5I) (simulation w/ clusters having min distance less than 1.0 was discarded.) • 4 clusters: each w/ 25 or 50 points in 10-D, normally distributed, w/ centers randomly chosen from N(0,1.9I) (simulation w/ clusters having min distance less than 1.0 was discarded.) • 2 clusters: each cluster contains 100 points in 3-D, elongated shape, well-separated
Overlapping Clusters • 50 observations from each of two bivariate normal populations with means (0,0) and (,0), and covariance I. • = 10 value in [0, 5] 10 simulations for each
Conclusion • Focus on well-separated clusters • Outperforms other approaches, when used with a uniform reference distribution in the principal component orientation • The simpler uniform reference over the range of data works well except wen the data are concentrated on a subspace