190 likes | 315 Views
Hierarchical Stability Based Model Selection for Data Clustering. Bing Yin Advisor: Greg Hamerly. Roadmap What is clustering? What is model selection for clustering algorithms? Stability Based Model Selection: Proposals and Problems Hierarchical Stability Based Model Selection ● Algorithm
E N D
Hierarchical Stability Based Model Selection for Data Clustering Bing Yin Advisor: Greg Hamerly
Roadmap What is clustering? What is model selection for clustering algorithms? Stability Based Model Selection: Proposals and Problems Hierarchical Stability Based Model Selection ● Algorithm ● Unimodality Test ● Experiments Future work Main Contribution ● Extended the concept of stability to hierarchical stability. ● Solved the symmetric data sets problem. ● Make stability a competing tool for model selection.
What is clustering? Given: ● data set of “objects” ● some relations between those objects: similarities, distances, neighborhoods, connections,… Goal: Find meaningful groups of objects s. t. ● objects in the same group are “similar” ● objects in different groups are “dissimilar” Clustering is: ● a form of unsupervised learning ● a method of data exploration
What is clustering? An Example Image Segmentation: Micro array Analysis: Serum Stimulation of Human Fibroblasts (Eisen,Spellman,PNAS,1998) ● 9800 spots representing 8600 genes ● 12 samples taken over 24 hour period ● Clusters can be roughly categorized as gene involved in A: cholesterol biosynthesis B: the cell cycle C: the immediate-early response D: signaling and angiogenesis E: wound healing and tissue remodeling Document Clustering Post-search Grouping Data Mining Social Network Analysis Gene Family Grouping …
What is clustering? An Algorithm K-Means algorithm (Lloyd, 1957) Given: data points X1,…,Xnd, number K clusters to find. 1. Randomly initialize the centers m10,…,mK0. 2. Iterate until convergence: 2.1 Assign each point to the closest center according to Euclidean distance, i.e., define clusters C1i+1,…,CKi+1 by Xs Cki+1 where ||Xs-mki||2< ||Xs-mli||2, l=1 to K 2.2 Compute the new cluster centers by mki+1 = Xs / |Cki+1| What is optimized? Minimizing within-cluster distances:
What is model selection? Clustering algorithms need to know the K before running. The correct answer of K for a given data is unknown So we need a better way to find this K and also the positions of the K centers This can be intuitively called model selection for clustering algorithms. Existing model selection method: ● Bayesian Information Criterion ● Gap statistics ● Projection Test … ● Stability based approach
Stability Based Model Selection The basic idea: ● scientific truth should be reproducible in experiments. Repeatedly run a clustering algorithm on the same data with parameter K and get a collection of clustering: ● If K is the correct model, clustering should be similar to each other ● If K is a wrong model, clustering may be quite different from each other This fact is referred as the stability of K (Ulrike von Luxburg,2007)
Stability Based Model Selection(2) Example on the toy data: If we can mathematically define this stability score for K, then stability can be used to find the correct model for the given data.
Define the Stability Variation of Information (VI) ● Clustering C1: X1,…,Xk and Clustering C2: X’1,…,X’k on date X ● The prob. point p belongs in Xi is : ● The entropy of C1: ● The joint prob. p in Xi and X’j is P(i,j) with entropy: ● The VI is defined as: VI indicates a distance between two clustering.
Define the stability (2) Calculate the VI score for a single K ● Clustering the data using K-Means for K clusters, run M times ● Calculate pair wise VI of these M clustering. ● Average the VI and use it as the VI score for K The calculated VI score for K indicates instability of K Try this over different K The K with lowest VI score/instability is chosen as the correct model
Define the Stability(3) An good example of Stability An bad example of Stability: symmetric data Why? Because Clustering data into 9 clusters apparently has more grouping choices than clustering them into 3.
Hierarchical Stability Problems with the concept of stability introduced above: ● Symmetric Data Set ● Only local optimization – the smaller K Proposed solution ● Analyze the stability in an hierarchical manner ● Do Unimodality Test to detect the termination of the recursion
Hierarchical Stability Given: Data set X HS-means: ● 1. Test if X is not a unimodal cluster ● 2. If yes, find the optimal K for X by analyzing stability; otherwise, X is a single cluster, return. ● 3. Partition X into K subsets ● 4. For each subset, recursively perform this algorithm from step 1 ● 5. Merge answers from each subset as answer for current data
Unimodality Test -2 Unimodality test Fact: sum of squared Gaussians follows 2 distribution. ● If x1,…,xd are d independent Gaussian variables, then S = x12+…+xd2 follows 2 distribution of degree d. For given data set X, calculate Si=Xi12+…+Xid2 ● If X is a single Gaussian, then S follows 2 of degree d ● Otherwise, S is not a 2 distribution.
Unimodality Test -Gap Test Fact: the within cluster dispersion drops most apparently with the correct K (Tibshirani, 2000) Given: Data set X, candidate k ● cluster X to k clusters and get within cluster dispersion Wk ● generate uniform data sets, cluster to k clusters, calculate W*k (averaged) ● gap(k) = W*k – Wk ● select smallest k s. t. gap(k)>gap(k+1) ● we use it in another way: just ask k=1?
Experiments Synthetic data ● Both Gaussian Distribution and Uniform Distribution ● In dimensions from 2 up to 20 ● c-separation between each cluster center and its nearest neighbor is 4 ● 200 points in each cluster, 10 clusters in total Handwritten Digits ● U.S. Postal Service handwritten digits ● 9298 instances in 256 dimensions ● 10 true clusters (maybe!) KDDD Control Curves ● 600 instances in 60 dimensions ● 6 true clusters, each has 100 instances
Experiments – symmetric data HS-means Lange Stability
Future Work ● Better Unimodality Testing approach. ● More detailed comparison on the performance with existing method like within cluster distance, VI metric and so on. ● Improve the speed of the algorithm.
Questions and Comments Thank you!