1 / 17

Clustering

Clustering. Jarno Tuimala. Clustering. Aim Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related to each other than to objects in other clusters Exploratory data analysis View all data simultaneously

penn
Download Presentation

Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Jarno Tuimala

  2. Clustering • Aim • Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related to each other than to objects in other clusters • Exploratory data analysis • View all data simultaneously • Identify clusters and patterns in data • Uses: • Time series analysis • Visualization of known classes

  3. Unsupervised vs. Supervised

  4. Clustering methods • Hierarchical clustering • single, average (UPGMA) and complete linkage • Non-hierarchical clustering • K-means

  5. Hierarchical clustering • Two phases • Pick a distance method • Euclidian • Pearson / Spearman correlation • Pick the dendrogram drawing method • Single linkage • Average linkage • Complete linkage

  6. Distances • Euclidian • Average difference between gene or chip expression profiles • Similar values are clustered together • Correlation • Difference in trends • Similar trends are clustered together • Typically: Pearson or Spearman correlation

  7. Dendrogram drawing Single, average, and complete linkage

  8. UPGMA example

  9. Hierarchical Clustering Silicon Genetics, 2003

  10. Heatmap

  11. K-means clustering • Partitioning method • The dataset is divided into K clusters • User needs to deside on the K before the run • K-means is heuristic algorithm, so different runs can give dissimilar results • Make several runs, and select the one giving the minimum sum of within-clusters variance

  12. K-means Clustering Silicon Genetics, 2003

  13. K-means Clustering Silicon Genetics, 2003

  14. K-means Clustering Silicon Genetics, 2003

  15. K-means Clustering Silicon Genetics, 2003

  16. Visualization

  17. Gene selection • Genes are usually filtered before clustering. • This decreases calculation time. • Typically a few hundred genes with highest variance (or standard deviation) are selected. • If you have, e.g., two types of cancers, do not use t-test for selecting genes. You will always get a result where the cancer type is differentiates between the clusters.

More Related