Alex Lewin (Imperial College, Dept of Epidemiology) Sylvia Richardson ( IC Epidemiology)

Bayesian modelling of gene expression data Alex Lewin (Imperial College, Dept of Epidemiology) Sylvia Richardson (IC Epidemiology) Tim Aitman (IC Microarray Centre) In collaboration with Anne-Mette Hein, Natalia Bochkina (IC Epidemiology) Helen Causton (IC Microarray Centre) Peter Green (Bristol)

Contents • Introduction to microarrays • Bayesian hierarchical model for differential gene expression • Simultaneous estimation of normalization and differential expression • Predictive model checks on prior for gene variances

Introduction to microarrays

Gene Expression DNA -> mRNA -> protein Pictures from http://www.emc.maricopa.edu/faculty/farabee/BIOBK/BioBookTOC.html

DNA TGCT RNA ACGA Hybridization • Known sequences of single-stranded DNA immobilised on microarray • Tissue sample (with unknown concentration of RNA) fluorescently labelled • Sample hybridised to array • Excess sample washed off array • Array scanned to measure amount of RNA present for each sequence on array

The Principle of Hybridisation * * * * * Zoom Image of Hybridised Array Hybridised Spot Single stranded, labeled RNA sample Oligonucleotide element 20µm Millions of copies of a specific oligonucleotide sequence element Expressed genes Approx. ½ million different complementary oligonucleotides Non-expressedgenes 1.28cm Image of Hybridised Array Slide courtesy of Affymetrix

Output of Microarray • Each gene is represented by several different DNA sequences (probes) • Obtain intensity for each probe • Different tissue samples on different arrays so compare gene expression for different experimental conditions

Low-level Model (how gene expression is estimated from signal) Normalisation (to make arrays comparable) Clustering Partition Model Differential Expression Microarray analysis is amulti-step process We aim to integrate all the steps in a common statistical framework

Bayesian Hierarchical Model for Differential Expression

Data Set and Biological question Previous Work (Tim Aitman, Anne Marie Glazier) The spontaneously hypertensive rat (SHR): A model of human insulin resistance syndromes. Deficiency in gene Cd36 found to be associated with insulin resistance in SHR (spontaneously hypertensive rat) Differential Expression 3 of 18

Data Set and Biological question Microarray Data 3 SHR compared with 3 transgenic rats 3 wildtype (normal) mice compared with 3 mice with Cd36 knocked out  12000 genes on each array Biological Question Find genes which are expressed differently in wildtype and knockout mice. Differential Expression 4 of 18

Condition 1 (3 replicates) Condition 2 (3 replicates) Data Needs ‘normalisation’ Spline curves shown Differential Expression 5 of 18

Model for Differential Expression • Expression-level-dependent normalisation • Only 3 replicates per gene, so share information between genes to estimate gene variances • To select interesting genes, use posterior distribution of log fold change Differential Expression 6 of 18

Bayesian hierarchical model for differential expression Data: ygsr = log gene expression for gene g, replicate r g = gene effect δg = differential effect for gene g between 2 conditions r(g)s = array effect (expression-level dependent) gs2 = gene variance • 1st level yg1r  N(g – ½ δg + r(g)1 , g12), yg2r  N(g + ½ δg + r(g)2 , g22), Σrr(g)s = 0, r(g)s = function of g , parameters {a} and {b} • 2nd level Priors for gδg, coefficients {a} and {b} gs2  lognormal (μs, τs)

Details of array effects (Normalization) Piecewise polynomial with unknown break points: r(g)s = quadratic in g for ars(k-1)≤ g ≤ ars(k) with coeff (brsk(1),brsk(2) ), k =1, … #breakpoints Locations of break points not fixed Must do sensitivity checks on # break points Cubic fits well for the data we are interested in

cubic loess Non linear fit of array effect as a function of gene effect

Wildtype Knockout Effect of normalisation on density Before (ygr) ^ After (ygr- r(g) )

Smoothing of the gene specific variances • Variances are estimated using information from all G x R measurements (~12000 x 3) rather than just 3 • Variances are stabilised and shrunk towards average variance

Simultaneous estimation of normalization and differential expression

Expression level dependent normalization Many gene expression data sets need normalization which depends on expression level. Usually normalization is performed in a pre-processing step before the model for differential expression is used. These analyses ignore the fact that the expression level is measured with variability. Ignoring this variability leads to bias in the function used for normalization. We show this on a simulated data set.

Mouse Data 3 replicate arrays (wildtype mouse data) Model: posterior means E(r(g)s | data) v. E(g | data) Data:ygsr - E(g | data)

Simulated Data • 1000 genes with 3 replicates under 2 conditions • Expression levels g between 0 and 10 (log scale) • g12  log Normal (-1.8,1), g22  log Normal (-2.2,1) • 900 genes: δg= 0 • 50 genes: δg N( log(3), 0.12) • 50 genes: δg N( -log(3), 0.12) • Array effects r(g)s cubic functions of g

Array Effects and Variability for Simulated Data

Two-step method (loess) • Use loess smoothing to obtain array effects loessr(g)s • Subtract loess array effects from data: yloessgsr = ygsr - loessr(g)s • Run our model on yloessgsrwith no array effects

Decision rules for selecting differentially expressed genes If P(δg > δcut | data) > pcut then gene g is called differentially expressed. We used δcut= log(3) – corresponds to null hypothesis. Various pcut – choose this according to acceptable error rate (e.g. False Discovery Rate).

Full model v. two-step method Plot observed False Discovery Rate against pcut (averaged over 5 simulations) Solid line for full model Dashed line for pre-normalized method

Two-step method (2 versions) • yloessgsr = ygsr - loessr(g)s • ymodelgsr = ygsr - E(r(g)s | data) • Results from 2 different two-step methods are closer to each other than to full model results.

Predictive model checks on prior for gene variances

Cross-validation p-values Data yi, i=1,…,n. For each observation yi, run model on rest of data y-i, predict new data yinew from model. Compare predicted data with observed validation data, via a Bayesian p-value. Bayesian p-value pi = Prob(yinew> yi | data) Distribution of p-values {pi,i=1,…,n} is approximately Uniform if model is adequate.

Predictive p-values For large n, notpossible to run model n times. Run model on all data.For each observation yi, predict new data yinew from model. Bayesian p-value pi = Prob(yinew> yi | data) Since data is used twice, the p-values are conservative (not quite Uniform).

Check priors on gene variances • Compare equal and exchangeable variance models • Compare different exchangeable priors Want to compare data for each gene, not gene and replicate, so use sample variance Sg2 (suppress index s here) Bayesian p-value Prob( Sg2 new > Sg2 obs | data) Use mixed predictive p-values: less conservative than posterior predictive p-values.

new new Sg2 ygr Posterior predictive Prior parameters Mean parameters g2 ygr r = 1:R Sg2 g = 1:G

new new new Sg2 ygr g2 Mixed predictive Prior parameters Mean parameters g2 ygr r = 1:R Sg2 g = 1:G

Four models for gene variances Equal variance model: Model 1: 2  log Normal (0, 10000) Exchangeable variance models: Model 2: g-2  Gamma (2, β) Model 3: g-2  Gamma (α, β) Model 4: g2  log Normal (μ, τ) (α, β, μ, τ all parameters)

Bayesian predictive p-values

Summary/Discussion • More false positives if normalization carried out in a pre-processing step. • Larger slope of array effects – larger difference between full and pre-normalized models • Predictive checks show exchangeable prior good for gene variances • Lewin, A., Richardson, S., Marshall C., Glazier A. and Aitman T. (2004) Bayesian Modelling of Differential Gene Expression, available at http ://www.bgx.org.uk/

Alex Lewin (Imperial College, Dept of Epidemiology) Sylvia Richardson ( IC Epidemiology)