Clustering

Clustering Jarno Tuimala

Clustering • Aim • Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related to each other than to objects in other clusters • Exploratory data analysis • View all data simultaneously • Identify clusters and patterns in data • Uses: • Time series analysis • Visualization of known classes

Unsupervised vs. Supervised

Clustering methods • Hierarchical clustering • single, average (UPGMA) and complete linkage • Non-hierarchical clustering • K-means

Hierarchical clustering • Two phases • Pick a distance method • Euclidian • Pearson / Spearman correlation • Pick the dendrogram drawing method • Single linkage • Average linkage • Complete linkage

Distances • Euclidian • Average difference between gene or chip expression profiles • Similar values are clustered together • Correlation • Difference in trends • Similar trends are clustered together • Typically: Pearson or Spearman correlation

Dendrogram drawing Single, average, and complete linkage

UPGMA example

Hierarchical Clustering Silicon Genetics, 2003

Heatmap

K-means clustering • Partitioning method • The dataset is divided into K clusters • User needs to deside on the K before the run • K-means is heuristic algorithm, so different runs can give dissimilar results • Make several runs, and select the one giving the minimum sum of within-clusters variance

K-means Clustering Silicon Genetics, 2003

Visualization

Gene selection • Genes are usually filtered before clustering. • This decreases calculation time. • Typically a few hundred genes with highest variance (or standard deviation) are selected. • If you have, e.g., two types of cancers, do not use t-test for selecting genes. You will always get a result where the cancer type is differentiates between the clusters.

Clustering

Clustering

Presentation Transcript

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering: Partition Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering