1 / 23

An Overview of Clustering Methods

This overview introduces the concept of clustering, including its mechanics, parameter choices, and examples. It discusses gene and sample clustering, as well as different clustering methods and their effects. The text is in English.

frobin
Download Presentation

An Overview of Clustering Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Overview of Clustering Methods Michael D. Kane, Ph.D.

  2. Topics • What is clustering? • Clustering mechanics (how the computer does it). • Parameter choices and their effect. • Examples.

  3. What is clustering? Grouping by similarity.

  4. Samples Gene clustering Similar genes.Group genes that have similar expression profiles when observed over multiple samples. Genes

  5. Samples Sample clustering Similar samples.Group samples that are similar when observed over multiple genes. Genes

  6. Why cluster? • Similar gene expression infers common biology. Function of uncharacterized genes may be deduced from co- expression with known genes. • Associate expression patterns with: Response to environmental change. Disease pathology/progression.

  7. + E2 E1 E2 E2 c Gene a Gene b a e d - + Gene c E1 b Gene d Gene e f Gene f - Clustering Mechanics For gene clustering, we must measure similarity between genes.

  8. + E2 c a e d - + E1 b f - Distance (similarity) measure Euclidean distance (1.0, 1.7) dbe (4.6, 0.5)

  9. Distance Measure Pearson Correlation S=(-1 . . . +1) Used in “Eisen” clustering

  10. + E2 c a e d - + E1 b f - Hierarchical Clustering a b c d e f

  11. Measuring distance between clusters Single linkage The minimum distance between clusters. May form loose clusters. Produces “chained” clusters. Complete linkage The maximum distance between clusters. Tends to form compact clusters.

  12. Methods for joining clusters UPGMA unweighted pair group method (Average linkage) The average distance between clusters. Weighted pair group method Same as UPGMA but the distance is weighted by cluster size. Use when clusters are expected to be significantly uneven in size!

  13. Effect of distance measure Euclidean Single Linkage Euclidean Complete Linkage

  14. Effect of distance measure Euclidean UPGMA Euclidean Ward’s Method

  15. k-means Alternatives to hierarchical clustering • Number of clusters specified by user. • Good when prior knowledge available.

  16. 1. Number of clusters specified by user. 2. Genes randomly assigned to clusters. + E2 c a e d - + E1 3. Assess inter and intra-cluster similarity. 3. Assess inter and intra-cluster similarity. b f 4. Move genes to alternative cluster if distance is reduced. 4. Move genes to alternative cluster if distance is reduced. - k-means clustering

  17. SOM Self-organizing maps Alternatives to hierarchical clustering • Number of clusters specified by user. • Good when prior knowledge available.

  18. E1 E2 E2 E1 E1 E1 E1 E1 E2 E2 E2 E2 E2 + 0 - + 0 - + 0 - + 0 - + 0 - cluster 1 cluster 1 cluster 1 cluster 1 cluster 1 E1 E2 + 0 - Gene a E1 E1 E1 E1 E1 E2 E2 E2 E2 E2 + 0 - + 0 - + 0 - + 0 - + 0 - cluster 2 cluster 2 cluster 2 cluster 2 cluster 2 + 0 - Gene b + 0 - Gene c E1 E1 E1 E1 E1 E2 E2 E2 E2 E2 + 0 - + 0 - + 0 - + 0 - + 0 - + 0 - Gene d cluster 3 cluster 3 cluster 3 cluster 3 cluster 3 + 0 - Gene e + 0 - After training, assign each gene to the most similar cluster. Increase the similarity by adjusting the cluster representation. “Training” For a gene, find the most similar cluster representation. User specified number of clusters. Each initially given a random expression representation. Iteratively train the cluster representations. Gene f SOM

  19. Gene clustering Eisen et al., Cluster analysis and display of genome-wide expression patterns. PNAS v95,14863-14868, 1998 cholesterol biosynthesis 24 hour time course after re-introduction of serum to serum-deprived human fibroblasts. Pearson correlation, average linkage. cell cycle immediate-early response signaling wound healing

  20. Sample clustering Ross et al., Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics v24, 227-235, 2000 Note breast cancer cell lines, derived from the same patient. 64 cancer cell lines clustered. 8,000 genes. Clustering performed with 2 different subsets of genes. Similar results. Pearson correlation, average linkage.

  21. Summary • Different methods often provide different clusters. • No overall “best” clustering method. • Clustering applied to unrelated data will still provide clusters. • Use biological insight in method selection and interpretation.

  22. + E2 c a e d - + E1 b f - Clustering a b c d e f

  23. E1 E2 + 0 - cluster 1 E1 E2 + 0 - Gene a E1 E2 + 0 - cluster 2 + 0 - Gene b + 0 - Gene c E1 E2 + 0 - + 0 - Gene d cluster 3 + 0 - Gene e + 0 - After training, assign each gene to the most similar cluster. Gene f SOM E2

More Related