1 / 49

Clustering

tessa
Download Presentation

Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Clustering

    2. Clustering Summarizes community data by placing sampling units (SUs) into clusters (groups) SUs in the same cluster have relatively low dissimilarities If possible, SUs in different clusters should have larger dissimilarities.

    3. Hierarchical vs Nonhierarchical Hierarchical methods produce nested sets of clusters represented by a dendrogram (tree diagram) Nonhierarchical methods optimize division into a specified number of clusters.

    4. Agglomerative vs Divisive Hierarchical methods can be either agglomerative: each SU starts off in its own and clusters are formed by successive fusions divisive: all SUs start off in one large cluster, which is then repeatedly split in two.

    5. Monothetic vs Polythetic Clustering Monothetic methods base each split (or fusion) on only one species (variables) most monothetic techniques are divisive Polythetic methods use multiple species (variables) to decide on each fusion or split.

    6. Hierarchical, polythetic, agglomerative clustering Many well-known and widely used techniques are of this type Most major statistical packages (e.g. SAS, SPSS, SYSTAT) include these methods Early applications to community data include Williams et al. (1966).

    7. Basic algorithm for hierarchical, polythetic, agglomerative clustering Compute dissimilarity matrix among all n SUs Put each SU into a cluster on its own Fuse the pair of clusters, p and q, with the smallest dissimilarity Set cluster q to empty and move all its SUs to cluster p Calculate the dissimilarity between the updated cluster p formed by this fusion and each other cluster (how this is done varies among methods) Repeat steps 3-5 n-1 times, by which time all SUs will have been fused into a single cluster.

    8. Pros & Cons of hierarchical, polythetic, agglomerative clustering Later fusions depend on earlier fusions (can lead to misclassification) Hierarchical constraint – solution not necessarily optimal for all groups Appealing for multilevel classifications Don’t have to decide on # of groups Let data structure guide you Can be used to partition complex datasets into manageable groups

    9. Combinatorial Strategies Lance & Williams (1967) recognized two types of fusion strategies, depending on how dissimilarities between a newly formed cluster and each other cluster are computed (step 5) Combinatorial strategies only need the values in the dissimilarity matrix Noncombinatorial new dissimilarities cannot be calculated from the previous ones requires access to the original data matrix after each step the whole dissimilaritiy matrix must be recalculated (requires more computing and memory).

    10. Basic Combinatorial Equation If clusters p and q have the smallest dissimilarity, Dpq, they are fused to form cluster r The dissimilarity, Dir, between new cluster r and each other cluster i is calculated as where ?p, ?q, ? and ? are coefficients that define the different clustering strategies.

    11. Single Linkage (Nearest Neighbor) Clustering Strategy Dissimilarity between two clusters is the minimum of all the dissimilarities between pairs of SUs that include a member of each cluster.

    12. Initial Dissimilarity Matrix At the start, each SU is in its own cluster, of size 1

    13. Clustering cycle 1 Fuse SU7 & SU8, creating cluster 7 Compute dissimilarity of cluster 7 to each other SU or cluster

    14. Clustering cycle 1 Dissimilarity matrix after cycle 1

    15. Clustering cycle 2 Fuse cluster 7 with SU 3, creating cluster 3 Compute dissimilarity of cluster 3 to each other SU or cluster

    16. Clustering cycle 2 Dissimilarity matrix after cycle 2

    17. Clustering cycle 3 Fuse SU 4 with cluster 3 Compute dissimilarity of updated cluster 3 to each other SU or cluster

    18. Clustering cycle 3 Dissimilarity matrix after cycle 3

    19. Clustering cycle 4 Fuse SU 6 with SU 5, creating cluster 5 Compute dissimilarity of cluster 5 to each other SU or cluster

    20. Clustering cycle 4 Dissimilarity matrix after cycle 4

    21. Clustering cycle 5 Fuse cluster 5 with SU 2, creating cluster 2 Compute dissimilarity of cluster 2 to each other SU or cluster

    22. Clustering cycle 5 Dissimilarity matrix after cycle 5

    23. Clustering cycle 6 Fuse cluster 2 with SU 1, creating cluster 1 Compute dissimilarity of cluster 1 to each other SU or cluster

    24. Clustering cycle 6 Dissimilarity matrix after cycle 6 Finally, cluster 3 fuses with cluster 1

    25. Plotting the dendrogram Simplest method is to use the dissimilarity at which fusions occur as the height scale Orientation of dichotomies is arbitrary: they can each be freely pivoted, like a child’s mobile.

    26. Scaling Dendrograms Distance function at each fusion point, a distance between the groups is given Wishart’s Objective Function (1969) measures information lost at each step as groups are fused information is lost at each fusion

    27. Single Linkage Dendrogram

    28. Chaining Chaining is the addition of single items (SUs) to existing clusters Unless the data contain discrete clusters, single-linkage dendrograms are usually highly chained Resulting classification is not very useful

    29. Space Contraction In single linkage, clusters become LESS dissimilar to remaining ungrouped SUs as they grow: the space around clusters appears to contract SUs are more likely to join an existing cluster rather than act as the nucleus of a new cluster This is why chaining occurs.

    30. Complete Linkage (Furthest Neighbor) Clustering Strategy Dissimilarity between two clusters is the maximum of all the dissimilarities between pairs of SUs that include a member of each cluster.

    31. Complete Linkage Dendrogram

    32. Space Dilation In complete linkage, clusters become MORE dissimilar to remaining ungrouped SUs as they grow: the space around clusters appears to expand SUs are less likely to join an existing cluster and more likely to act as the nucleus of a new cluster Dendrograms produced by space dilating strategies have clear clusters with long stems and not much chaining.

    33. Ward’s Method Fuses clusters that minimize the increase in within-group sum-of-squares (squared distances of SUs from the centroid of their cluster) Only defined for Euclidean dissimilarities (squared Euclidean distances).

    34. Ward’s Method Dendrogram (Squared Chord Distance)

    35. Ward’s Method Is space-conserving Tends to produce clusters with equal numbers of SUs Squared chord distance is the recommended dissimilarity measure if using Ward’s on community data.

    36. Average Linkage (Group Average, UPGMA) Clustering Strategy Dissimilarity between two clusters is the average of all the dissimilarities between pairs of SUs that include a member of each cluster.

    37. Average Linkage Dendrogram

    38. Space-conserving Methods In average linkage, clusters do not become more or less dissimilar to remaining ungrouped SUs as they grow: the space around clusters is preserved Dendrograms produced by space conserving strategies tend to be intermediate in structure not as much chaining as single linkage for continuous data, Ward’s method produces more equitable clusters than Average linkage

    39. Lance-Williams Flexible Beta Clustering ? can be chosen between -1 and +1 values between 0 and 1 produce weak, space contracting behavior ? ? -0.25 is space conserving (-0.25 is like Ward’s) negative values less than -0.25 produce strong, space dilating behavior

    40. Flexible, Beta=-0.5 Dendrogram

    42. Two-way Indicator Species Analysis (TWINSPAN) Popular computer program TWINSPAN Divisive, hierarchical clustering Each dichotomy is determined by splitting the first axis of a CA ordination Indicator species are defined for each dichotomy Uses presence data only: abundance data are recoded into a series of “pseudospecies”.

    43. Two-way Indicator Species Analysis (TWINSPAN) Once the SU clustering is complete, a matrix of species X clusters is produced an ordered 2-way table is produced that shows the SU and species clusters seeks groups in species data and reports indicator species for those groups

    44. Two-way Indicator Species Analysis (TWINSPAN) The method is extremely popular, probably because of the extra “goodies” provided by the TWINSPAN program, rather than the quality of the SU clustering results compared to other methods Should not be used by ecologists UNLESS a 2-way ordered table is needed for a data set with a simple 1-D structure

    45. Which method? Most community data are relatively continuous, without “natural” clusters separated by discontinuities Strongly clustering (space dilating) methods are best to divide up the variation into equitable chunks Flexible with beta = -0.25 or -0.5 Ward’s method (using squared Chord distance) But, if there are some discontinuities in the data, space dilating methods may not indicate them

    46. How many clusters? Once you have the complete dendrogram, branches can be “pruned” at different levels There is no simple statistical test for the number of clusters required Decision is a compromise between interpretability (small number of clusters) and within-cluster variability Plotting cluster membership on an ordination helps to visualize how well they divide up the total variability.

    47. Interpreting Cluster Results Examine differences among clusters in the community data used to produce the clusters mean abundances and frequencies of occurrence of species within each cluster indicator species analysis environmental variables (or other possible explanatory variables) summary statistics (means, SEs, box plots) in each cluster one-way ANOVAs and multiple-means comparisons MANOVA, Discriminant Analysis

    48. Nonhierarchical Methods You specify the number of groups items are placed into those groups, attempting to optimize a statistical characteristic of the groups “K-means” Most commonly used Useful for large datasets

    49. Indicator Species Analysis Dufręne & Legendre (1997) Identifies species that are good “indicators” of particular clusters Based on combination of two properties Fidelity: degree to which a species is confined to the cluster Constancy: percentage of SUs in the cluster in which the species is present An ideal indicator has both high fidelity and high constancy.

More Related