240 likes | 374 Views
Develop mathematical, statistical, and computational methods to analyse biologically or technologically novel experiments in order to understand disease-relevant regulatory and genetic interaction networks. What we do.
E N D
Develop mathematical, statistical, and computational methods to analyse biologically or technologically novel experiments in order to understand disease-relevant regulatory and genetic interaction networks
What we do High-density marrays for RNA transcription and protein-DNA binding: Regulatory networks in heart development Jörn Tödling (with Silke Sperling, MPI Molecular Genetics) Fundamentals of transcription and genetics in yeast Matt Ritchie (with Lars Steinmetz, EMBL) High-throughput RNAi assays, high-content automated microscopy, genetic interaction networks Oleg Sklyar, Ligia Bras, Thomas Horn (with Michael Boutros, DKFZ, Robert Gentleman, FHCRC Seattle, Amy Kiger, UCSD) Bioconductor
log-ratio which genes are differentially transcribed? same-same tumor-normal
Statistics 101: biasaccuracy precision variance
Basic dogma of data analysis: Can always increase sensitivity on the cost of specificity, or vice versa, the art is to find the best trade-off. X X X X X X X X X
3000 3000 x3 ? 1500 200 1000 0 ? x1.5 A A B B C C But what if the gene is “off” (below detection limit) in one condition? ratios and fold changes Fold changes are useful to describe continuous changes in expression
ratios and fold changes The idea of the log-ratio (base 2) 0: no change +1: up by factor of 21 = 2 +2: up by factor of 22 = 4 -1: down by factor of 2-1 = 1/2 -2: down by factor of 2-2 = ¼ A unit for measuring changes in expression: assumes that a change from 1000 to 2000 units has a similar biological meaning to one from 5000 to 10000. What about a change from 0 to 500? - conceptually - noise, measurement precision
ratio compression Yue et al., (Incyte Genomics) NAR (2001) 29 e41
Systematic Stochastic o similar effect on many measurements o corrections can be estimated from data o too random to be ex-plicitely accounted for o remain as “noise” Calibration Error model Sources of variation amount of RNA in the biopsy efficiencies of -RNA extraction -reverse transcription -labeling -fluorescent detection probe purity and length distribution spotting efficiency, spot size cross-/unspecific hybridization stray signal
bi per-sample normalization factor bk sequence-wise probe efficiency hik ~ N(0,s22) “multiplicative noise” ai per-sample offset eik ~ N(0, bi2s12) “additive noise” modeling ansatz measured intensity = offset + gain true abundance
“multiplicative” noise “additive” noise The two-component model raw scale log scale B. Durbin, D. Rocke, JCB 2001
variance stabilizing transformations Xu a family of random variables with EXu=u, VarXu=v(u). Define var f(Xu ) independent of u derivation: linear approximation
the “glog” transformation - - - f(x) = log(x) ———hs(x) = asinh(x/s) P. Munson, 2001 D. Rocke & B. Durbin, ISMB 2002
generalized log-ratio difference log-ratio variance: constant part proportional part glog raw scale log glog
“usual” log-ratio 'glog' (generalized log-ratio) c1, c2are experiment specific parameters (~level of background noise)
Variance Bias Trade-Off Estimated log-fold-change log glog Signal intensity
Variance-bias trade-off and shrinkage estimators Shrinkage estimators: pay a small price in bias for a large decrease of variance, so overall the mean-squared-error (MSE) is reduced. Particularly useful if you have few replicates. Generalized log-ratio: = a shrinkage estimator for fold change There are many possible choices, one is stabilization of variance: + interpretabality even in cases where gene is off in some conditions + can subsequently use standard statistical methods (hypothesis testing, ANOVA, clustering, classification…) with less worries about uneven variances
evaluation: a benchmark for Affymetrix genechip expression measures o Data: Spike-in series: from Affymetrix 59 x HGU95A, 16 genes, 14 concentrations, complex background Dilution series: from GeneLogic 60 x HGU95Av2, liver & CNS cRNA in different proportions and amounts o Benchmark: 15 quality measures regarding -reproducibility -sensitivity -specificity Put together by Rafael Irizarry (Johns Hopkins) http://affycomp.biostat.jhsph.edu
good affycomp results (28 Sep 2003) bad
Availability oimplementation in R oopen source package vsn on www.bioconductor.org othe next step: sequence-specific norma-lization (till here, just array-specific) oideas of shrinkage and variance stabilization are now used in mainstream preprocessing programs for Affymetrix data (PLIER, GCRMA)
Acknowledgements MPI Molekulare Genetik Anja von Heydebreck Martin Vingron Uni Heidelberg Günther Sawitzki DFCI Harvard Robert Gentleman UMC Leiden Judith Boer RZPD Anke Schroth Bernd Korn ...and many more! DKFZ Heidelberg Molecular Genome Analysis Annemarie Poustka Holger Sültmann Andreas Buneß Markus Ruschhaupt Katharina Finis Jörg Schneider Klaus Steiner Stefan Wiemann Dorit Arlt