Soft clustering of gene expression data

Matthias E. Futschik Institute for Theoretical Biology Humboldt-University, Berlin, Germany Soft clustering of gene expression data

Clustering methodology • Hierachical clustering • can be divisive or agglomerative producing nested clusters. • Results are usually visualised by tree structures dendrogram. • Clustering depends on the linkage procedure used: single, complete, average,... • Partitional clustering • divides data into a (pre-)chosen number of classes. • Examples: k-means, SOMs, fuzzy c-means, simulated annealing, model-based clustering, HMMs,... • Setting the number of clusters is problematic • Cluster validity: • Most cluster algorithms always detect clusters, even in random data. • Cluster validation approaches address the number of existing clusters. • Approaches are based on objective functions, figures of merits, resampling, adding noise ....

Hard clustering: Based on classical set theory Assigns a gene to exactly one cluster No differentiation how well gene is represented by cluster centroid Examples: hierachical clustering, k-means, SOMs, ... Soft clustering: Can assign a gene to several cluster Differentiate grade of representation (cluster membership) Example: Fuzzy c-means, HMMs, ... Hard clustering vs. soft clustering

Example data set: Yeast cell cylce data by Cho et al. Hard clustering is sensitive to noise Standard deviation of expression Standard procedure is pre-filtering of genes based on variation due to noise sensitvity of hard clustering. However, no obvious threshold exists! (Heyer et al.: ca. 4000 genes, Tavazoe et al.: 3000 genes, Tamayo et al.: 823 genes) => Risk of essential losing information => Need of noise robust clustering method

Soft clustering is more noise robust Hard clustering always detects clusters, even in random data Soft clustering differentiates cluster strength and, thus, can avoid detection of 'random' clusters Genes with high membership values cluster together inspite of added noise

Differentiation in cluster membership allows profiling of cluster cores • A gene can be assigned to several clusters • Each gene is assigned to a cluster with a membership value between 0 and 1 • The membership values of a gene add up to one • Genes with lower membership values are not well represented by the cluster centroid • Expression of genes with high membership values are close to cluster centroid => Clusters have internal structures Membership value > 0.5 Hard clustering Membership value > 0.7

Varitation in cluster parameter reveals cluster stability m=1.3 m=1.1 Variation of fuzzification parameter m determines 'hardness' of clustering: m → 1: Fuzzy c-means clustering becomes equivalent to k-means m → ∞: All genes are equivally assigned to all clusters. Strong clusters maintain their core for increasing m By variation of m clusters can be distinguished by their stability. Weak cluster lose their core

Periodic and aperiodic clusters Periodic clusters of yeast cell cycle: Aperiodic clusters: => Aperiodic clusters were generally weaker than periodic clusters

Global clustering structure Non-linear 2D-projection by Sammon's Mapping c-means clustering allows definition of overlap of clusters i.e. how many genes are shared by two clusters. This enables to define a similarity measure between clusters. Global clustering structures can be visualised by graphs i.e. edges representing overlap. Increasing number of clusters => Sub-clustering reveals sub-structures M. Futschik and B. Carlisle, Noise robust, soft clustering of gene expression data (JBCB, Vol. 3, No. 4, 965-988, 2005)

Soft clustering of gene expression data

Soft clustering of gene expression data

Presentation Transcript

Clustering analysis of microarray gene expression data

Lecture 9: Gene expression analysis/Clustering

Basic Gene Expression Data Analysis--Clustering

Evaluation and optimization of clustering in gene expression data analysis

Analysis of Gene Expression Data

Clustering Gene Expression Data

Clustering by soft-constraint affinity propagation: applications to gene-expression data

Discrimination and clustering with microarray gene expression data

Context-Specific Bayesian Clustering for Gene Expression Data

Applications of Visualization and Data Clustering to 3D Gene Expression Data

Principal Component Analysis (PCA) for Clustering Gene Expression Data

Clustering short time series gene expression data

Probabilistic Techniques for the Clustering of Gene Expression Data

Principal Component Analysis (PCA) for Clustering Gene Expression Data

Clustering Gene Expression Data

Gene Expression Data

Clustering Short Gene Expression Profiles

Clustering analysis of microarray gene expression data

Clustering Gene Expression Data

Clustering Gene Expression Data