1 / 13

Shrinkage-based similarity metric for cluster analysis of microarray data

Shrinkage-based similarity metric for cluster analysis of microarray data. V. Cherepinski, J. Feng, M. Rejali, B. Mishra Courant Institute of Mathematical Science, New York Cold Spring Harbor Laboratory PNAS 2003. Outline. Reminder Normal mean estimation James-Stein estimator & shrinkage

Download Presentation

Shrinkage-based similarity metric for cluster analysis of microarray data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Shrinkage-based similarity metric for cluster analysis of microarray data V. Cherepinski, J. Feng, M. Rejali, B. Mishra Courant Institute of Mathematical Science, New York Cold Spring Harbor Laboratory PNAS 2003

  2. Outline • Reminder • Normal mean estimation • James-Stein estimator & shrinkage • Cluster analysis of gene expression data • Classical approach • Robustness issues • Shrinkage-base similarity metric proposal • Results

  3. Normal mean estimation • Let be a sample from a normal distribution • with and a known variance • Given a sample , how the mean can be estimated ? • Classical estimation theory • Maximum likelihood estimator: • Nice properties • Unbiased: • Risk: • Asymptotically best unbiased estimator • Scales easily with multiple samples : (sample mean)

  4. James-Stein estimator • James & Stein introduced in 1971 the estimator (the shrinkage factor shifts towards 0) • The JS estimator always beats the ML estimator • If small, JS risk ~ (much better than !) • But • Biased: • Multivariate normal assumption with known noise • Risk savings decrease with a large number of samples with

  5. James-Stein estimator • Justification • Norm of the MLE tends to be too large: • Shrinkage parameter shifts JSE towards 0 • JS estimator is a surprising result • Not widely used in applications • Is there any useful application in bioinformatics ?

  6. Cluster analysis of gene expression • Goal • Given the expression of genes over conditions • How the genes can be clustered ? • Classical approach • Let be the expression of gene on condition • Coexpression of genes and measured by Pearson correlation • Hierarchical clustering: average-linkage (greedy grouping of most correlated genes, subject to a threshold) with and

  7. Issues • Cancelling systematic shift in experiments • Eisen [PNAS 1998] • Modification of the correlation measure • Amounts to normalize the such that their average is zero • Quite drastic approach ! and with

  8. Shrinkage-based similarity metric • Shrinkage-based similarity metric • Shrinkage factor shifts towards 0 • Pearson: • Eisen: • What is the best from a Bayesian point of view ? with and

  9. Model • Denote by the expression of gene at condition • Assuming • For each gene , expression is normally distributed on conditions • Mean gene expression is also normally distributed • Estimation of • Using a Bayesian framework • Correlation metrics between genes and • Clairvoyant: • Pearson: • Eisen: • Shrinkage:

  10. Results on synthetic data • Data • 2 genes, 100 samples • Results

  11. Biological data • Cell-cycle transcription factors in yeast • Data: Cell-cycle microarray time-course data from Eisen [PNAS 1998] • Gold standard: ChIP data from Simon [Cell 2001]

  12. Clustering results • Framework • Using Eisen, Pearson and Shrinkage correlation measure • Hierarchical clustering using different threshold • Clusters are FP/FN scored according to the gold standard

  13. Conclusion • Modern robust statistics estimation • Clean framework • Improvements (if any) are minimal on this example • Lacks non-parametric robust correlation metrics: Spearman, Kendall • ChIP data may not be the best reference

More Related