180 likes | 196 Views
This study compares different clustering algorithms for gene expression data and introduces methods for generating robust and consensus clusters. The algorithms are tested on two datasets, providing specific advantages for array-based gene expression analysis.
E N D
Generating Robust and Consensus Clusters from Gene Expression Data Allan Tuckera, Stephen Swifta, Xiaohui Liua, Nigel Martinb, Christine Orengoc, Paul Kellamc a b c
Introduction • Many different clustering algorithms used for gene expression analysis • Little work on inter-method consistency or cross-comparison • Important due to differing results (each algorithm implicitly forces a structure on data) • Obtaining a consensus across methods should improve confidence
The Talk • Compare a number of existing methods for clustering gene expression data • Algorithms for generating robust clusters and consensus clusters • Tested on a set of Amersham Scorecard data with known structure and experimentally obtained virus B-Cell data • Provides specific advantages in the analysis of array based gene expression data
Clustering Methods • Hierarchical Clustering (R) • PAM (R) • CAST (C++) • Simulated Annealing (C++)
Datasets • Amersham Scorecard • 597 genes, 24 blocks with 32 columns and 12 rows under 30 experimental conditions • Repeated experiments which we assume should cluster together • B Cell Data • 1987 genes
Robust Clustering • Takes agreement matrix as input • Place all genes into robust clusters that have full agreement • Deterministic algorithm • Should give higher degree of confidence in clusters • Not all genes will be assigned
Dataset ASC B-cell No. of Robust Clusters 24 154 % of variables assigned 79% 25% Max. Robust Cluster size 44 14 Min. Robust Cluster size 2 2 Mean Robust Cluster size 10.2 3.2 Robust Clustering
Consensus Clustering • “Full agreement” requirement for robust clusters can be too restrictive • Algorithm for generating consensus clusters given minimum agreement parameter • Approximate stochastic algorithm
Consensus Clustering Input Cluster Results Agreement Matrix Consensus Clusters
Consensus Clustering B-Cell Dataset ASC Dataset
Summary • Clustering biological data is very useful • Biases in clustering algorithms can mean success in identification of patterns vary • Consensus algorithms used in protein secondary structure prediction • We apply similar strategy with robust and consensus clustering
Conclusions • Robust clusters good for identifying common transcriptional modules • Also for identifying genes with common functional pathway • Useful for creating clusters of genes with high confidence • Can be restrictive in discarding genes that do not have full agreement.
Conclusions • Consensus clustering relaxes full agreement requirement • Resembles defined clusters in synthetic data very well • Reliably picks out features in the virus gene expression data • Fulfils desire not to rely on one clustering algorithm during gene expression analysis
Acknowledgements • The Biotechnology and Biological Sciences Research Council (BBSRC), UK • The Engineering and Physical Sciences Research Council (EPSRC), UK