1 / 2

More on Choosing #Clusters in General (not just k-means (fusion plot etc in chapter))

More on Choosing #Clusters in General (not just k-means (fusion plot etc in chapter)). Some researchers do their cluster analysis and then to demonstrate that the resulting clusters are “significantly” different, they run a (one-way) anova and voila, show the F is large.

naida
Download Presentation

More on Choosing #Clusters in General (not just k-means (fusion plot etc in chapter))

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. More on Choosing #Clusters in General (not just k-means (fusion plot etc in chapter)) • Some researchers do their cluster analysis and then to demonstrate that the resulting clusters are “significantly” different, they run a (one-way) anova and voila, show the F is large. • Well duh! The cluster analysis’s objective was to find groups that were maximally separable. • Take a look at Milligan & Cooper (1985). They compared some 30 methods of trying to determine the proper #clusters. They found 3 criteria that produced good results: a pseudo F (Calinski & Harabasz 1974), a J statistic (Duda & Hart 1973), and CCC, the cubic clustering criterion. The 1st and 3rd of these are displayed in SAS (Proc Cluster). • For example, the pseudo F: • N=#observations (sample size) • C=#clusters (at a particular level of the clustering hierarchy) • Look at the eqn: it’s basically MSbetween/MSwithin • so larger is better, and of course, need to factor in that it should get better w >C • If multivariate normal, distributed F on p(C-1) & p(N-C) df (where p=#vars), • And can compare F across # C’s to find optimal C

  2. More on Choosing #Clusters in General • References • Breckenridge, James N. (2000), “Validating Cluster Analysis: Consistent Replication and Symmetry,” Multivariate Behavioral Research, 35 (2), 261-285. • Calinski, R. B. and J. Harabasz (1974), “A Dendrite Method for Cluster Analysis,” Communications in Statistics, 3, 1-27. • Krolak-Schwerdt, Sabine and Thomas Eckes (1992), “A Graph Theoretic Criterion for Determining the Number of Clusters in a Data Set,” Multivariate Behavioral Research, 27 (4), 541-565. • Milligan, Glenn W. and Martha C. Cooper (1985), “An Examination of Procedures for Determining the Number of Clusters in a Data Set,” Psychometrika, 50, 159-179. • Steinley, Douglas and Michael J. Brusco (2011), “Choosing the Number of Clusters in K-Means Clustering,” Psychological Methods, 16 (3), 285-297.

More Related