E N D
1. Clustering
2. Clustering Summarizes community data by placing sampling units (SUs) into clusters (groups)
SUs in the same cluster have relatively low dissimilarities
If possible, SUs in different clusters should have larger dissimilarities.
3. Hierarchical vs Nonhierarchical Hierarchical methods produce nested sets of clusters represented by a dendrogram (tree diagram)
Nonhierarchical methods optimize division into a specified number of clusters.
4. Agglomerative vs Divisive Hierarchical methods can be either
agglomerative: each SU starts off in its own and clusters are formed by successive fusions
divisive: all SUs start off in one large cluster, which is then repeatedly split in two.
5. Monothetic vs Polythetic Clustering Monothetic methods base each split (or fusion) on only one species (variables)
most monothetic techniques are divisive
Polythetic methods use multiple species (variables) to decide on each fusion or split.
6. Hierarchical, polythetic, agglomerative clustering Many well-known and widely used techniques are of this type
Most major statistical packages (e.g. SAS, SPSS, SYSTAT) include these methods
Early applications to community data include Williams et al. (1966).
7. Basic algorithm for hierarchical, polythetic, agglomerative clustering Compute dissimilarity matrix among all n SUs
Put each SU into a cluster on its own
Fuse the pair of clusters, p and q, with the smallest dissimilarity
Set cluster q to empty and move all its SUs to cluster p
Calculate the dissimilarity between the updated cluster p formed by this fusion and each other cluster (how this is done varies among methods)
Repeat steps 3-5 n-1 times, by which time all SUs will have been fused into a single cluster.
8. Pros & Cons of hierarchical, polythetic, agglomerative clustering Later fusions depend on earlier fusions (can lead to misclassification)
Hierarchical constraint – solution not necessarily optimal for all groups
Appealing for multilevel classifications
Don’t have to decide on # of groups
Let data structure guide you
Can be used to partition complex datasets into manageable groups
9. Combinatorial Strategies Lance & Williams (1967) recognized two types of fusion strategies, depending on how dissimilarities between a newly formed cluster and each other cluster are computed (step 5)
Combinatorial strategies only need the values in the dissimilarity matrix
Noncombinatorial
new dissimilarities cannot be calculated from the previous ones
requires access to the original data matrix
after each step the whole dissimilaritiy matrix must be recalculated (requires more computing and memory).
10. Basic Combinatorial Equation If clusters p and q have the smallest dissimilarity, Dpq, they are fused to form cluster r
The dissimilarity, Dir, between new cluster r and each other cluster i is calculated as
where ?p, ?q, ? and ? are coefficients that define the different clustering strategies.
11. Single Linkage (Nearest Neighbor) Clustering Strategy Dissimilarity between two clusters is the minimum of all the dissimilarities between pairs of SUs that include a member of each cluster.
12. Initial Dissimilarity Matrix At the start, each SU is in its own cluster, of size 1
13. Clustering cycle 1 Fuse SU7 & SU8, creating cluster 7
Compute dissimilarity of cluster 7 to each other SU or cluster
14. Clustering cycle 1 Dissimilarity matrix after cycle 1
15. Clustering cycle 2 Fuse cluster 7 with SU 3, creating cluster 3
Compute dissimilarity of cluster 3 to each other SU or cluster
16. Clustering cycle 2 Dissimilarity matrix after cycle 2
17. Clustering cycle 3 Fuse SU 4 with cluster 3
Compute dissimilarity of updated cluster 3 to each other SU or cluster
18. Clustering cycle 3 Dissimilarity matrix after cycle 3
19. Clustering cycle 4 Fuse SU 6 with SU 5, creating cluster 5
Compute dissimilarity of cluster 5 to each other SU or cluster
20. Clustering cycle 4 Dissimilarity matrix after cycle 4
21. Clustering cycle 5 Fuse cluster 5 with SU 2, creating cluster 2
Compute dissimilarity of cluster 2 to each other SU or cluster
22. Clustering cycle 5 Dissimilarity matrix after cycle 5
23. Clustering cycle 6 Fuse cluster 2 with SU 1, creating cluster 1
Compute dissimilarity of cluster 1 to each other SU or cluster
24. Clustering cycle 6 Dissimilarity matrix after cycle 6
Finally, cluster 3 fuses with cluster 1
25. Plotting the dendrogram Simplest method is to use the dissimilarity at which fusions occur as the height scale
Orientation of dichotomies is arbitrary: they can each be freely pivoted, like a child’s mobile.
26. Scaling Dendrograms Distance function
at each fusion point, a distance between the groups is given
Wishart’s Objective Function (1969)
measures information lost at each step as groups are fused
information is lost at each fusion
27. Single Linkage Dendrogram
28. Chaining Chaining is the addition of single items (SUs) to existing clusters
Unless the data contain discrete clusters, single-linkage dendrograms are usually highly chained
Resulting classification is not very useful
29. Space Contraction In single linkage, clusters become LESS dissimilar to remaining ungrouped SUs as they grow: the space around clusters appears to contract
SUs are more likely to join an existing cluster rather than act as the nucleus of a new cluster
This is why chaining occurs.
30. Complete Linkage (Furthest Neighbor) Clustering Strategy Dissimilarity between two clusters is the maximum of all the dissimilarities between pairs of SUs that include a member of each cluster.
31. Complete Linkage Dendrogram
32. Space Dilation In complete linkage, clusters become MORE dissimilar to remaining ungrouped SUs as they grow: the space around clusters appears to expand
SUs are less likely to join an existing cluster and more likely to act as the nucleus of a new cluster
Dendrograms produced by space dilating strategies have clear clusters with long stems and not much chaining.
33. Ward’s Method Fuses clusters that minimize the increase in within-group sum-of-squares (squared distances of SUs from the centroid of their cluster)
Only defined for Euclidean dissimilarities (squared Euclidean distances).
34. Ward’s Method Dendrogram(Squared Chord Distance)
35. Ward’s Method Is space-conserving
Tends to produce clusters with equal numbers of SUs
Squared chord distance is the recommended dissimilarity measure if using Ward’s on community data.
36. Average Linkage (Group Average, UPGMA) Clustering Strategy Dissimilarity between two clusters is the average of all the dissimilarities between pairs of SUs that include a member of each cluster.
37. Average Linkage Dendrogram
38. Space-conserving Methods In average linkage, clusters do not become more or less dissimilar to remaining ungrouped SUs as they grow: the space around clusters is preserved
Dendrograms produced by space conserving strategies tend to be intermediate in structure
not as much chaining as single linkage
for continuous data, Ward’s method produces more equitable clusters than Average linkage
39. Lance-Williams Flexible Beta Clustering ? can be chosen between -1 and +1
values between 0 and 1 produce weak, space contracting behavior
? ? -0.25 is space conserving (-0.25 is like Ward’s)
negative values less than -0.25 produce strong, space dilating behavior
40. Flexible, Beta=-0.5 Dendrogram
42. Two-way Indicator Species Analysis(TWINSPAN) Popular computer program TWINSPAN
Divisive, hierarchical clustering
Each dichotomy is determined by splitting the first axis of a CA ordination
Indicator species are defined for each dichotomy
Uses presence data only: abundance data are recoded into a series of “pseudospecies”.
43. Two-way Indicator Species Analysis(TWINSPAN) Once the SU clustering is complete, a matrix of species X clusters is produced
an ordered 2-way table is produced that shows the SU and species clusters
seeks groups in species data and reports indicator species for those groups
44. Two-way Indicator Species Analysis(TWINSPAN) The method is extremely popular, probably because of the extra “goodies” provided by the TWINSPAN program, rather than the quality of the SU clustering results compared to other methods
Should not be used by ecologists UNLESS a 2-way ordered table is needed for a data set with a simple 1-D structure
45. Which method? Most community data are relatively continuous, without “natural” clusters separated by discontinuities
Strongly clustering (space dilating) methods are best to divide up the variation into equitable chunks
Flexible with beta = -0.25 or -0.5
Ward’s method (using squared Chord distance)
But, if there are some discontinuities in the data, space dilating methods may not indicate them
46. How many clusters? Once you have the complete dendrogram, branches can be “pruned” at different levels
There is no simple statistical test for the number of clusters required
Decision is a compromise between interpretability (small number of clusters) and within-cluster variability
Plotting cluster membership on an ordination helps to visualize how well they divide up the total variability.
47. Interpreting Cluster Results Examine differences among clusters in
the community data used to produce the clusters
mean abundances and frequencies of occurrence of species within each cluster
indicator species analysis
environmental variables (or other possible explanatory variables)
summary statistics (means, SEs, box plots) in each cluster
one-way ANOVAs and multiple-means comparisons
MANOVA, Discriminant Analysis
48. Nonhierarchical Methods You specify the number of groups
items are placed into those groups, attempting to optimize a statistical characteristic of the groups
“K-means”
Most commonly used
Useful for large datasets
49. Indicator Species Analysis Dufręne & Legendre (1997)
Identifies species that are good “indicators” of particular clusters
Based on combination of two properties
Fidelity: degree to which a species is confined to the cluster
Constancy: percentage of SUs in the cluster in which the species is present
An ideal indicator has both high fidelity and high constancy.