170 likes | 306 Views
Clustering. Jarno Tuimala. Clustering. Aim Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related to each other than to objects in other clusters Exploratory data analysis View all data simultaneously
E N D
Clustering Jarno Tuimala
Clustering • Aim • Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related to each other than to objects in other clusters • Exploratory data analysis • View all data simultaneously • Identify clusters and patterns in data • Uses: • Time series analysis • Visualization of known classes
Clustering methods • Hierarchical clustering • single, average (UPGMA) and complete linkage • Non-hierarchical clustering • K-means
Hierarchical clustering • Two phases • Pick a distance method • Euclidian • Pearson / Spearman correlation • Pick the dendrogram drawing method • Single linkage • Average linkage • Complete linkage
Distances • Euclidian • Average difference between gene or chip expression profiles • Similar values are clustered together • Correlation • Difference in trends • Similar trends are clustered together • Typically: Pearson or Spearman correlation
Dendrogram drawing Single, average, and complete linkage
Hierarchical Clustering Silicon Genetics, 2003
K-means clustering • Partitioning method • The dataset is divided into K clusters • User needs to deside on the K before the run • K-means is heuristic algorithm, so different runs can give dissimilar results • Make several runs, and select the one giving the minimum sum of within-clusters variance
K-means Clustering Silicon Genetics, 2003
K-means Clustering Silicon Genetics, 2003
K-means Clustering Silicon Genetics, 2003
K-means Clustering Silicon Genetics, 2003
Gene selection • Genes are usually filtered before clustering. • This decreases calculation time. • Typically a few hundred genes with highest variance (or standard deviation) are selected. • If you have, e.g., two types of cancers, do not use t-test for selecting genes. You will always get a result where the cancer type is differentiates between the clusters.