Applications of microarrays

Applications of microarrays Measuring transcript abundance (result of production versus degradation) Mapping transcript structure (alternative splicing, TSSs, or degradation; UTRs) Genotyping Estimating DNA copy number (CGH) DNA-protein interactions …

Preprocessing, error models, quality assessment

abundance vs transcription rate In principle, they are independent. If only passive degradation:

Response curve Lockhart et. al. Nature Biotechnology 14 (1996)

compression Yue et al., (Incyte Genomics) NAR (2001) 29 e41

log-ratio Which genes are differentially transcribed? same-same tumor-normal

Statistics 101: biasaccuracy  precision variance

Basic dogma of data analysis: Can always increase sensitivity on the cost of specificity, or vice versa, the art is to find the sweet spot. X X X X X X X X X (It can also be possible to increase both by better choice of method / model)

3000 3000 x3 ? 1500 200 1000 0 ? x1.5 A A B B C C But what if the gene is “off” (below detection limit) in one condition? ratios and fold changes Fold changes are useful to describe continuous changes in expression

fold change estimation and background correction • Many interesting genes will be off in some of the conditions of interest • Due to unspecific hybridization and optical noise, measured values are always > 0. • If you want expression measure to be an unbiased estimator of abundance • strong background correction, get many values  0 • need something else than (log)ratio • 2. If you let expression measure be biased (always>0) • weak background correction, then can keep ratios. • how do you choose the bias?

Raw data are not mRNA concentrations The problem is less that these steps are ‘not perfect’; it is that they may vary from gene to gene, array to array, experiment to experiment.

Systematic Stochastic o similar effect on many measurements o corrections can be estimated from data o too random to be ex-plicitely accounted for o remain as “noise” Calibration Error model Sources of variation amount of RNA in the sample efficiencies of -RNA extraction -reverse transcription -labeling -fluorescent detection probe purity and length distribution cross-/unspecific hybridization stray signal

bi per-sample normalization factor bk sequence-wise probe efficiency hik ~ N(0,s22) “multiplicative noise” ai per-sample offset eik ~ N(0, bi2s12) “additive noise” modeling ansatz measured intensity = offset + gain  true abundance

“multiplicative” noise “additive” noise  The two-component model raw scale log scale B. Durbin, D. Rocke, JCB 2001

variance stabilizing transformations Xu a family of random variables with EXu=u, VarXu=v(u). Define var f(Xu ) independent of u derivation: linear approximation

variance stabilization f(x) x

1.) constant variance (‘additive’) 2.) constant CV (‘multiplicative’) 3.) offset 4.) additive and multiplicative  variance stabilizing transformations

the “glog” transformation - - - f(x) = log(x) ———hs(x) = asinh(x/s) P. Munson, 2001 D. Rocke & B. Durbin, ISMB 2002

generalized log-ratio difference log-ratio variance: constant part proportional part variance stabilization raw scale log glog

parameter estimation o maximum likelihood estimator: straightforward – but sensitive to deviations from normality o model holds for genes that are unchanged; differentially transcribed genes act as outliers. o robust variant of ML estimator, à la Least Trimmed Sum of Squares regression. o works well as long as <50% of genes are differentially transcribed (and may still work otherwise)

Least trimmed sum of squares regression minimize P. Rousseeuw, 1980s - least sum of squares - least trimmed sum of squares

evaluation: effects of different data transformations difference red-green rank(average)

glog

For Affymetrix data, it turns out that the weak background correction method of RMA and the glog(-ratio) of vsn result in very similar results vsn also useful for other array platforms (e.g. spotted two-color) Don't be afraid of the "glog", it is equivalent to weak (=biased) background correction and normal log! vsn package (see vignette) Ref.: Huber, von Heydebreck et al., Bioinformatics 2002

evaluation: sensitivity / specificity in detecting differential abundance o Data: paired tumor/normal tissue from 19 kidney cancers, in color flip duplicates on 38 cDNA slides à 4000 genes. o 6 different strategies for normalization and quantification of differential abundance o Calculate for each gene & each method: t-statistics, permutation-p oFor threshold a, compare the number of genes the different methods find, #{pi | pia}

evaluation: comparison of methods one-sided test for upone-sided test for down more accurate quantification of differential expression  higher sensitivity / specificity

 evaluation: a benchmark for Affymetrix genechip expression measures o Data: Spike-in series: from Affymetrix 59 x HGU95A, 16 genes, 14 concentrations, complex background Dilution series: from GeneLogic 60 x HGU95Av2, liver & CNS cRNA in different proportions and amounts o Benchmark: 15 quality measures regarding -reproducibility -sensitivity -specificity Put together by Rafael Irizarry (Johns Hopkins) http://affycomp.biostat.jhsph.edu

 ROC curves

affycomp results good bad

Probe Set Summarization

Probe set summarization - data and notation PMijg , MMijg= Intensities for perfect match and mismatch probe j for gene g in chip i i = 1,…, n one to hundreds of chips j = 1,…, J usually 11 or 16 probe pairs g= 1,…, G 6…30,000 probe sets. Tasks: calibrate (normalize) the measurements from different chips (samples) summarize for each probe set the probe level data, i.e., 16 PM and MM pairs, into a single expression measure. compare between chips (samples) for detecting differential expression.

expression measures: MAS 4.0 Affymetrix GeneChip MAS 4.0 software uses AvDiff, a trimmed mean: o sort dj = PMj -MMj o exclude highest and lowest value o J := those pairs within 3 standard deviations of the average

Expression measures MAS 5.0 Instead of MM, use "repaired" version CT CT= MM if MM<PM = PM / "typical log-ratio" if MM>=PM "Signal" = Tukey.Biweight (log(PM-CT)) (…median) Tukey Biweight: B(x) = (1 – (x/c)^2)^2 if |x|<c, 0 otherwise

Expression measures: Li & Wong dChip fits a model for each gene where • qi: expression index for gene i • fj: probe sensitivity Maximum likelihood estimate of MBEI is used as expression measure of the gene in chip i. Need at least 10 or 20 chips. Current version works with PMs only.

Robust expression measures RMA: Irizarry et al. (2002) AvDiff-like with A a set of “suitable” pairs. Li-Wong-like: additive model Estimate RMA = ai for chip i using robust method median polish (successively remove row and column medians, accumulate terms, until convergence). Works with d>=2

Expression measures RMA: Irizarry et al. (2002) o Estimate one global background value b=mode(MM). No probe-specific background! o Assume: PM = strue + b Estimate s0 from PM and b as a conditional expectation E[strue|PM, b]. o Use log2(s). o Nonparametric nonlinear calibration ('quantile normalization') across a set of chips.

Affymetrix: IPM = IMM + Ispecific ? log(PM/MM) From: R. Irizarry et al., Biostatistics 2002 0

Applications of microarrays

Applications of microarrays

Presentation Transcript

Microarrays 2

Microarrays

Microarrays

Microarrays

cDNA Microarrays and some of their applications to Clinical Medicine

Microarrays

Results: Microarrays

Microarrays

MICROARRAYS

DNA Microarrays

DNA Microarrays

MICROARRAYS

Microarrays

Microarrays/CNVs

Microarrays

Microarrays

Applications of protein microarrays

The Bioinformatics of Microarrays

Protein microarrays

Microarrays

Microarrays

Antibody Microarrays