360 likes | 609 Views
Model-based clustering of gene expression data. Ka Yee Yeung 1 ,Chris Fraley 2 , Alejandro Murua 3 , Adrian E. Raftery 2 , and Walter L. Ruzzo 1 1 Department of Computer Science & Engineering, University of Washington 2 Department of Statistics, University of Washington
E N D
Model-based clustering of gene expression data Ka Yee Yeung1,Chris Fraley2, Alejandro Murua3, Adrian E. Raftery2, andWalter L. Ruzzo1 1Department of Computer Science & Engineering, University of Washington 2Department of Statistics, University of Washington 3Insightful Corporation CAIMS 2001UW CSE Computational Biology Group
Model-based clustering of gene expression data Walter L. Ruzzo University of Washington Computer Science and Engineering CAIMS 2001UW CSE Computational Biology Group
Overview • Overview of gene expression data • Model-based clustering • Data sets • Results • Summary and Conclusions
A gene expression data set …….. • Each chip is one experiment: • time course • drug response • tissue samples (normal/cancer) • … • Snapshot of activities of all genes in the cell p experiments n genes Xij
Analysis of expression data? • Hard problem • High dimensionality • Noise & artifacts • Goals loosely defined, etc. • Common exploratory analysis technique: Clustering
Why Cluster? • Find biologically related genes • Tissue classification • Many algorithms • Most heuristic-based • No clear winner How to cluster?
Overview • Overview of gene expression data • Model-based clustering • Data sets • Results • Summary and Conclusions
Model-based clustering • Mixture model: • Each mixture component generated by a specified underlying probability distribution • Gaussian mixture model: • Each component k is modeled by the multivariate normal distribution with parameters : • Mean vector: mk • Covariance matrix: Sk • EM algorithm: to estimate parameters
volume shape Sk=lkDkAkDkT orientation More flexibleMore descriptive But more parameters Covariance models(Banfield & Raftery 1993) • Equal volume spherical model (EI): ~ kmeans Sk = lI • Unequal volume spherical (VI): Sk = lkI • Diagonal model: Sk = lkBk, where Bk is diagonal with | Bk|=1 • EEE elliptical model: Sk = lDADT • Unconstrained model (VVV): Sk = lkDkAkDkT
Advantages of model-based clustering • Higher quality clusters • Flexible models • Model selection - choose right model and right # of clusters • Bayesian Information Criterion (BIC): • Approximate Bayes factor: posterior odds for one model against another model • A large BIC score indicates strong evidence for the corresponding model.
Overview • Overview of gene expression data • Model-based clustering • Data sets • Gene expression data sets • Synthetic data sets • Results • Summary and Conclusions
Our Goal • To show the model-based approach has superior performance with respect to: • Quality of clusters • Number of clusters and model chosen (BIC)
Our Approach • Validate the results from model-based clustering using data sets with external criteria (BIC scores do not require the external criteria) • To compare clusters with external criterion: • Adjusted Rand index(Hubert and Arabie 1985) • Adjusted Rand index = 1 perfect agreement • 2 random partitions have an expected index of 0 • Compare the quality of clusters with a leading heuristic-based algorithm: CAST (Ben-Dor & Yakhini 1999)
Gene expression data sets • Ovarian cancer data set (Michel Schummer, Institute of Systems Biology) • Subset of data : 235 clones 24 experiments (cancer/normal tissue samples) • 235 clones correspond to 4 genes • Yeast cell cycle data (Cho et al 1998) • 17 time points • Subset of 384 genes associated with 5 phases of cell cycle
Synthetic data sets Both based on ovary data • Gaussian mixture • Generate multivariate normal distributions with the sample covariance matrix and mean vector of each class in the ovary data • Randomly resampled ovary data • For each class, randomly sample the expression levels in each experiment, independently • Near diagonal covariance matrix
Overview • Overview of gene expression data • Model-based clustering • Data sets • Results • Synthetic data sets • Expression data sets • Summary and Conclusions
Results: Gaussian mixture based on ovary data(2350 genes) • At 4 clusters, VVV, diagonal, CAST : high adjusted Rand • BIC selects VVV at 4 clusters.
Results: randomly resampled ovary data • Diagonal model achieves the max adjusted Rand and BIC score (higher than CAST) • BIC max at 4 clusters • Confirms expected result
Results: square root ovary data • Max adjusted Rand at EEE 4 clusters (> CAST) • BIC selects VI at 8 clusters • BIC curve shows a bend at 4 clusters for EEE.
Results: log yeast cell cycle data • CAST achieves much higher adjusted Rand indices than most model-based approaches (except EEE). • BIC scores of EEE much higher than the other models.
Results: standardized yeast cell cycle data • EI slightly higher adjusted Rand indices than CAST at 5 clusters. • BIC selects EEE at 5 clusters.
Overview • Overview of gene expression data • Model-based clustering • Data sets • Results • Summary and Conclusions
Summary and Conclusions • Synthetic data sets: • With the correct model, the model-based clustering better than a leading heuristic clustering algorithm • BIC selects the right model & right number of clusters • Real expression data sets: • Comparable adjusted Rand indices to CAST • BIC gives a hint to the number of clusters • Time course data: • Standardization makes classes spherical
Future Work • Custom refinements to the model-based implementation: • Design models that incorporate specific information about the experiments, eg. Block diagonal covariance matrix • Missing data • Outliers & noise
Acknowledgements • Ka Yee Yeung1,Chris Fraley2, Alejandro Murua3, Adrian E. Raftery2 1Department of Computer Science & Engineering, University of Washington 2Department of Statistics, University of Washington 3Insightful Corporation • Michèl Schummer (Institute of Systems Biology) • the ovary data set • Jeremy Tantrum • help with MBC software (diagonal model)
More Info http://www.cs.washington.edu/homes/ruzzo CAIMS 2001UW CSE Computational Biology Group
Model-based clustering • Mixture model: • Assume each component is generated by an underlying probability distribution • Likelihood for the mixture model: • Gaussian mixture model: • Each component k is modeled by the multivariate normal distribution with parameters : • Mean vector: mk • Covariance matrix: Sk
Eigenvalue decomposition of the covariance matrix • Covariance matrix Sk determines geometric features of each component k. • Banfield and Raftery (1993): • Dk: orientation of component k (orthogonal matrix) • Ak: shape of component k (diagonal matrix) • lk : volume of component k (scalar) • Allowing some but not all of the parameters to vary many models within this framework
EM algorithm • General appraoch to maximum likelihood • Iterate between E and M steps: • E step: compute the probability of each observation belonging to each cluster using the current parameter estimates • M-step: estimate model parameters using the current group membership probabilities
Definition of the BIC score • The integrated likelihood p(D|Mk) is hard to evaluate, where D is the data, Mk is the model. • BIC is an approximation to log p(D|Mk) • uk: number of parameters to be estimated in model Mk
Randomly resampled synthetic data set Ovary data Synthetic data experiments class genes
Results: mixture of normal distributions based on ovary data (235 genes) • VVV and CAST: comparable adjusted Rand index at 4 clusters • With only 235 genes, no BIC scores available from VVV. • With bigger data sets (2350 genes), BIC selects VVV at 4 clusters.
Results: log ovary data • Max adjusted Rand: VVV 4 clusters, EEE 4-6 clusters (higher than CAST) • BIC selects VI at 9 clusters • BIC curve shows a bend at 4 clusters for the diagonal model.
Results: standardized ovary data • Comparable adjusted Rand indices for CAST, EI and EEE at 4 clusters. • BIC selects EEE at 7 clusters.
Key advantage of the model-based approach: choose the right model and the number of clusters • Bayesian Information Criterion (BIC): • Approximate Bayes factor: posterior odds for one model against another model • A large BIC score indicates strong evidence for the corresponding model.