1 / 19

Hierarchical Stability Based Model Selection for Data Clustering

Hierarchical Stability Based Model Selection for Data Clustering. Bing Yin Advisor: Greg Hamerly. Roadmap What is clustering? What is model selection for clustering algorithms? Stability Based Model Selection: Proposals and Problems Hierarchical Stability Based Model Selection ● Algorithm

marek
Download Presentation

Hierarchical Stability Based Model Selection for Data Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hierarchical Stability Based Model Selection for Data Clustering Bing Yin Advisor: Greg Hamerly

  2. Roadmap What is clustering? What is model selection for clustering algorithms? Stability Based Model Selection: Proposals and Problems Hierarchical Stability Based Model Selection ● Algorithm ● Unimodality Test ● Experiments Future work Main Contribution ● Extended the concept of stability to hierarchical stability. ● Solved the symmetric data sets problem. ● Make stability a competing tool for model selection.

  3. What is clustering? Given: ● data set of “objects” ● some relations between those objects: similarities, distances, neighborhoods, connections,… Goal: Find meaningful groups of objects s. t. ● objects in the same group are “similar” ● objects in different groups are “dissimilar” Clustering is: ● a form of unsupervised learning ● a method of data exploration

  4. What is clustering? An Example Image Segmentation: Micro array Analysis: Serum Stimulation of Human Fibroblasts (Eisen,Spellman,PNAS,1998) ● 9800 spots representing 8600 genes ● 12 samples taken over 24 hour period ● Clusters can be roughly categorized as gene involved in A: cholesterol biosynthesis B: the cell cycle C: the immediate-early response D: signaling and angiogenesis E: wound healing and tissue remodeling Document Clustering Post-search Grouping Data Mining Social Network Analysis Gene Family Grouping …

  5. What is clustering? An Algorithm K-Means algorithm (Lloyd, 1957) Given: data points X1,…,Xnd, number K clusters to find. 1. Randomly initialize the centers m10,…,mK0. 2. Iterate until convergence: 2.1 Assign each point to the closest center according to Euclidean distance, i.e., define clusters C1i+1,…,CKi+1 by Xs  Cki+1 where ||Xs-mki||2< ||Xs-mli||2, l=1 to K 2.2 Compute the new cluster centers by mki+1 =  Xs / |Cki+1| What is optimized? Minimizing within-cluster distances:

  6. What is model selection? Clustering algorithms need to know the K before running. The correct answer of K for a given data is unknown So we need a better way to find this K and also the positions of the K centers This can be intuitively called model selection for clustering algorithms. Existing model selection method: ● Bayesian Information Criterion ● Gap statistics ● Projection Test … ● Stability based approach

  7. Stability Based Model Selection The basic idea: ● scientific truth should be reproducible in experiments. Repeatedly run a clustering algorithm on the same data with parameter K and get a collection of clustering: ● If K is the correct model, clustering should be similar to each other ● If K is a wrong model, clustering may be quite different from each other This fact is referred as the stability of K (Ulrike von Luxburg,2007)

  8. Stability Based Model Selection(2) Example on the toy data: If we can mathematically define this stability score for K, then stability can be used to find the correct model for the given data.

  9. Define the Stability Variation of Information (VI) ● Clustering C1: X1,…,Xk and Clustering C2: X’1,…,X’k on date X ● The prob. point p belongs in Xi is : ● The entropy of C1: ● The joint prob. p in Xi and X’j is P(i,j) with entropy: ● The VI is defined as: VI indicates a distance between two clustering.

  10. Define the stability (2) Calculate the VI score for a single K ● Clustering the data using K-Means for K clusters, run M times ● Calculate pair wise VI of these M clustering. ● Average the VI and use it as the VI score for K The calculated VI score for K indicates instability of K Try this over different K The K with lowest VI score/instability is chosen as the correct model

  11. Define the Stability(3) An good example of Stability An bad example of Stability: symmetric data Why? Because Clustering data into 9 clusters apparently has more grouping choices than clustering them into 3.

  12. Hierarchical Stability Problems with the concept of stability introduced above: ● Symmetric Data Set ● Only local optimization – the smaller K Proposed solution ● Analyze the stability in an hierarchical manner ● Do Unimodality Test to detect the termination of the recursion

  13. Hierarchical Stability Given: Data set X HS-means: ● 1. Test if X is not a unimodal cluster ● 2. If yes, find the optimal K for X by analyzing stability; otherwise, X is a single cluster, return. ● 3. Partition X into K subsets ● 4. For each subset, recursively perform this algorithm from step 1 ● 5. Merge answers from each subset as answer for current data

  14. Unimodality Test -2 Unimodality test Fact: sum of squared Gaussians follows 2 distribution. ● If x1,…,xd are d independent Gaussian variables, then S = x12+…+xd2 follows 2 distribution of degree d. For given data set X, calculate Si=Xi12+…+Xid2 ● If X is a single Gaussian, then S follows 2 of degree d ● Otherwise, S is not a 2 distribution.

  15. Unimodality Test -Gap Test Fact: the within cluster dispersion drops most apparently with the correct K (Tibshirani, 2000) Given: Data set X, candidate k ● cluster X to k clusters and get within cluster dispersion Wk ● generate uniform data sets, cluster to k clusters, calculate W*k (averaged) ● gap(k) = W*k – Wk ● select smallest k s. t. gap(k)>gap(k+1) ● we use it in another way: just ask k=1?

  16. Experiments Synthetic data ● Both Gaussian Distribution and Uniform Distribution ● In dimensions from 2 up to 20 ● c-separation between each cluster center and its nearest neighbor is 4 ● 200 points in each cluster, 10 clusters in total Handwritten Digits ● U.S. Postal Service handwritten digits ● 9298 instances in 256 dimensions ● 10 true clusters (maybe!) KDDD Control Curves ● 600 instances in 60 dimensions ● 6 true clusters, each has 100 instances

  17. Experiments – symmetric data HS-means Lange Stability

  18. Future Work ● Better Unimodality Testing approach. ● More detailed comparison on the performance with existing method like within cluster distance, VI metric and so on. ● Improve the speed of the algorithm.

  19. Questions and Comments Thank you!

More Related