Cluster Analysis

Cluster Analysis Hal Whitehead BIOL4062/5062

What is cluster analysis? • Non-hierarchical cluster analysis • K-means • Hierarchical divisive cluster analysis • Hierarchical agglomerative cluster analysis • Linkage: single, complete, average, … • Cophenetic correlation coefficient • Additive trees • Problems with cluster analyses

? Cluster Analysis “Classification” Maximize within cluster homogeneity (similar individuals within cluster) “The Search for Discontinuities” Discontinuities: places to put divisions between clusters

Discontinuities: Discontinuities generally present: taxonomy social organization community ecology??

Types of cluster analysis: • Uses: data, dissimilarity, similarity matrix • Non-hierarchical • K-means • Hierarchical • Hierarchical divisive (repeated K-means, network methods) • Hierarchical agglomerative • single linkage, average linkage, ... • Additive trees

Non-hierarchical Clustering Techniques:K-Means • Uses data matrix with Euclidean distances • Maximizes between-cluster variance for given number of clusters • i.e. Choose clusters to maximize F-ratio in 1-way MANOVA

K-Means Works iteratively: 1. Choose number of clusters 2. Assigns points to clusters Randomly or some other clustering technique 3. Moves each point to other clusters in turn--increase in between cluster variance? 4. Repeat step 3. until no improvement possible

K-means with three clusters

K-means with three clusters Variable Between SS df Within SS df F-ratio X 0.536 2 0.007 7 256.163 Y 0.541 2 0.050 7 37.566 ** TOTAL ** 1.078 4 0.058 14

K-means with three clusters Cluster 1 of 3 contains 4 cases Members Statistics Case Distance | Variable Minimum Mean Maximum St.Dev. Case 1 0.02 | X 0.41 0.45 0.49 0.04 Case 2 0.11 | Y 0.03 0.19 0.27 0.11 Case 3 0.06 | Case 4 0.05 | Cluster 2 of 3 contains 4 cases Members Statistics Case Distance | Variable Minimum Mean Maximum St.Dev. Case 7 0.06 | X 0.11 0.15 0.19 0.03 Case 8 0.03 | Y 0.61 0.70 0.77 0.07 Case 9 0.02 | Case 10 0.06 | Cluster 3 of 3 contains 2 cases Members Statistics Case Distance | Variable Minimum Mean Maximum St.Dev. Case 5 0.01 | X 0.77 0.77 0.78 0.01 Case 6 0.01 | Y 0.33 0.35 0.36 0.02

Disadvantages of K-means • Reaches optimum, but not necessarily global • Must choose number of clusters before analysis • How many clusters?

Example: Sperm whale codas Patterned series of clicks: | | | | | ic1 ic2 ic3 ic4 For 5-click codas: 681 x 4 data set

5-click codas:K-means with 10 clusters

Hierarchical Cluster Analysis • Usually represented by: • Dendrogram or tree-diagram

Hierarchical Cluster Analysis • Hierarchical Divisive Cluster Analysis • Hierarchical Agglomerative Cluster Analysis

Hierarchical Divisive Cluster Analysis • Starts with all units in one cluster, successively splits them • Successive use of K-Means, or some other divisive technique, with n=2 • Either: Each time use the cluster with the greatest sum of squared distances • Or: Split each cluster each time. • Hierarchical divisive are good techniques, but rarely used, outside network analysis

Hierarchical Agglomerative Cluster Analysis • Start with each individual units occupying its own cluster • The clusters are then gradually merged until just one is left • The most common cluster analyses

Hierarchical Agglomerative Cluster Analysis Works on dissimilarity matrix or negative similarity matrix may be Euclidean, Penrose, … distances At each step: 1. There is a symmetric matrix of dissimilarities between clusters 2. The two clusters with least dissimilarity are merged 3. The dissimilarity between the new (merged) cluster and all others is calculated Different techniques do step 3. in different ways:

A B C D E A 0 . . . . B 0.35 0 . . . C 0.45 0.67 0 . . D 0.11 0.45 0.57 0 . E 0.22 0.56 0.78 0.19 0 AD B C E AD 0 . . . B? 0 . . C? 0.67 0 . E? 0.56 0.78 0 First link Aand D Hierarchical Agglomerative Cluster Analysis How to calculate new disimmilarities?

A B C D E A 0 . . . . B0.35 0 . . . C 0.45 0.67 0 . . D 0.11 0.45 0.57 0 . E 0.22 0.56 0.78 0.19 0 AD B C E AD 0 . . . B0.35 0 . . C? 0.67 0 . E? 0.56 0.78 0 Hierarchical Agglomerative Cluster AnalysisSingle Linkage d(AD,B)=Min{d(A,B), d(D,B)}

A B C D E A 0 . . . . B0.35 0 . . . C 0.45 0.67 0 . . D 0.11 0.45 0.57 0 . E 0.22 0.56 0.78 0.19 0 AD B C E AD 0 . . . B0.45 0 . . C? 0.67 0 . E? 0.56 0.78 0 Hierarchical Agglomerative Cluster AnalysisComplete Linkage d(AD,B)=Max{d(A,B), d(D,B)}

A B C D E A 0 . . . . B0.35 0 . . . C 0.45 0.67 0 . . D 0.11 0.45 0.57 0 . E 0.22 0.56 0.78 0.19 0 AD B C E AD 0 . . . B0.40 0 . . C? 0.67 0 . E? 0.56 0.78 0 Hierarchical Agglomerative Cluster AnalysisAverage Linkage d(AD,B)=Mean{d(A,B), d(D,B)}

V1 V2 V3 A 0.11 0.75 0.33 B 0.35 0.99 0.41 C 0.45 0.67 0.22 D 0.11 0.71 0.37 E 0.22 0.56 0.78 F 0.13 0.14 0.55 G 0.55 0.90 0.21 V1 V2 V3 AD0.11 0.73 0.35 B 0.35 0.99 0.41 C 0.45 0.67 0.22 E 0.22 0.56 0.78 F 0.13 0.14 0.55 G 0.55 0.90 0.21 Hierarchical Agglomerative Cluster AnalysisCentroid Clustering (uses data matrix, or true distance matrix) V1(AD)=Mean{V1(A),V1(D)}

Hierarchical Agglomerative Cluster AnalysisWard’s Method • Minimizes within-cluster sum-of squares • Similar to centroid clustering

1 1.00 2 0.00 1.00 4 0.53 0.00 1.00 5 0.18 0.05 0.00 1.00 9 0.22 0.09 0.13 0.25 1.00 11 0.36 0.00 0.17 0.40 0.33 1.00 12 0.00 0.37 0.18 0.00 0.13 0.00 1.00 140.74 0.00 0.30 0.20 0.23 0.17 0.00 1.00 150.53 0.00 0.30 0.00 0.36 0.00 0.26 0.56 1.00 19 0.00 0.00 0.17 0.21 0.43 0.32 0.29 0.09 0.091.00 20 0.04 0.00 0.17 0.00 0.14 0.10 0.35 0.00 0.18 0.25 1.00 1 2 4 5 9 11 12 14 15 19 20

Hierarchical Agglomerative Clustering Techniques • Single Linkage • Produces “straggly” clusters • Not recommended if much experimental error • Used in taxonomy • Invariant to transformations • Complete Linkage • Produces “tight” clusters • Not recommended if much experimental error • Invariant to transformations • Average Linkage, Centroid, Ward’s • Most likely to mimic input clusters • Not invariant to transformations in dissimilarity measure

Cophenetic Correlation Coefficient CCC • Correlation between original disimilarity matrix and dissimilarity inferred from cluster analysis • CCC >~ 0.8 indicate a good match • CCC <~ 0.8, dendrogram not a good representation • probably should not be displayed • Use CCC to choose best linkage method (highest coefficient) 1 1.00 2 0.00 1.00 4 0.53 0.00 1.00 5 0.18 0.05 0.00 1.00 9 0.22 0.09 0.13 0.25 1.00 11 0.36 0.00 0.17 0.40 0.33 1.00 12 0.00 0.37 0.18 0.00 0.13 0.00 1.00 140.74 0.00 0.30 0.20 0.23 0.17 0.00 1.00 150.53 0.00 0.30 0.00 0.36 0.00 0.26 0.56 1.00 19 0.00 0.00 0.17 0.21 0.43 0.32 0.29 0.09 0.091.00 20 0.04 0.00 0.17 0.00 0.14 0.10 0.35 0.00 0.18 0.25 1.00 1 2 4 5 9 11 12 14 15 19 20

CCC=0.77 CCC=0.83 CCC=0.80 CCC=0.75

Additive trees • Dendrogram in which path lengths represent dissimilarities • Computation quite complex (cross between agglomerative techniques and multidimensional scaling) • Good when data are measured as dissimilarities • Often used in taxonomy and genetics A B C D E A . . . . . B 14 . . . . C 6 12 . . . D 81 7 13 . . E 17 1 6 16 .

Problems with Cluster Analysis • Are there really biologically-meaningful clusters in the data? • Does the dendrogram represent biological reality (web-of-life versus tree-of-life)? • How many clusters to use? • stopping rules are arbitrary • Which method to use? • best technique is data-dependent • Dendrograms become messy with many units

Social Structure of 160 northern bottlenose whales

Clustering Techniques Type Technique Use Non-hierarchicalK-Means Dividing data sets Hierarchical divisiveRepeated K-means Good technique on small data sets Network methods ... Hierarchical agglomerative Single linkage Taxonomy Complete linkage Tighter Clusters Average linkage, Centroid, Ward’s Usually Preferred HierarchicalAdditive treesExcellent for displaying dissimilarity; taxonomy, genetics

Cluster Analysis