Sylvia Richardson, with Alex Lewin Department of Epidemiology and Public Health, Imperial College

Bayesian modelling of differential gene expression data Sylvia Richardson, with Alex Lewin Department of Epidemiology and Public Health, Imperial College In collaboration with Anne Mette Hein and Clare Marshall (Imperial) Philippe Broët (INSERM) and Peter Green (Bristol) Helen Causton and Tim Aitman (Hammersmith)

Outline • Introduction: what is gene expression data ? • Array effects (normalisation) • Exchangeable model for the gene specific variances and Bayesian model checks • Differential expression • Rank statistics • Mixture models for differential expression

What is gene expression data ? • Protein-encoding genes are transcribed into mRNA (messenger), and the mRNA is translated to make proteins, the building blocks of living cells • DNA Microarrays can be used to measure the relative abundance of mRNA, providing information on gene expression in a particular cell type, under particular conditions • The fundamental principle used to measure the expression is that of hybridisation Introduction 1

The Principle of Hybridisation * * * * * Zoom Image of Hybridised Array Hybridised Spot Single stranded, labeled RNA sample Oligonucleotide element 20µm Millions of copies of a specific oligonucleotide sequence element Expressed genes Approx. ½ million different complementary oligonucleotides Non-expressed genes 1.28cm Image of Hybridised Array Slide courtesy of Affymetrix Introduction 2

Hybridisation indicates that the corresponding gene is present because : • The binding is highly specific • The sequences represented on the array are designed to be unique in the genome • The expression level of thousands of genes are measured on a single microarray • gene expression profile • There are different types of arrays : • we consider the case of oligonucleotide arrays where one extract per slide is hybridised Introduction 3

Variation and uncertainty Gene expression data (e.g. Affymetrix) is the result of multiple sources of variability • Treatment • Response to genetic and environmental conditions • Biological heterogeneity • Sample preparation • Array manufacture • Hybridisation process • Imaging signal Different components of noise Introduction 4

Variation and uncertainty Gene expression data (e.g. Affymetrix) is the result of multiple sources of variability • Treatment • Response to genetic and environmental conditions • Biological heterogeneity inherent • Sample preparation • Array manufacture • Hybridisation process • Imaging signal Good practice minimizes these sources of variation

Gene expression analysis is a multi-step process Low-level Model (how is the measured expression related to the signal) We aim to integrate all the steps in a common statistical framework Normalisation (to make samples comparable) Clustering Partition Model Differential Expression Introduction 5

Bayesian hierarchical model framework • Ability to model various sources of variability: e.g. detailed modelling of experimental variability: within array, between array, estimation of gene specific variability … • Building of all these features into a common model uncertainty is propagated • Ability to borrow / share information in appropriate ways to get better estimates Introduction 6

Data Set and Biological question Previous Work (Tim Aitman, Anne Marie Glazier) Deficiency in gene Cd36 found to be associated with insulin resistance in SHR (spontaneously hypertensive rat) Good animal model to tease out genes implicated in this syndrome Microarray Study • 3 SHR compared with 3 transgenic rats • 3 wildtype mice compared with 3 knockout mice • Two tissues: fat and heart • Affymetrix chips U34A-C and U74A-C ( 12000 genes)

I Additive (log scale) model for expression Notation • ygr = gene expression measurements (log scale) for gene g, g=1, …, N, replicate r, r = 1,…, R Additive model: ygr = g + r(g) + gr Here: -- g is the expression level of the gth gene, -- r(g)is the array effect (normalisation term) possibly dependent on g through the expression level g , constraints needed to ensure identifiability: Σrr(g)= 0 -- gran error term, mean 0, Var(gr ) = σg2 Model 1

Exploratory analysis of array effect ri, r =1:3, i=1:6 6 equal size groups of genes defined by mean expression level Each array corresponds to a different colour Estimate a constant array effect ri for each group i, i =1:6 ygr = g + ri + gr Wildtype mouse fat data on 3 arrays Model 2

Flexible model of the array effect The exploratory analysis suggests to model the array effect as a (smooth) function of g Piecewise polynomial with unknown break points: r(g) = quadratic in g for ark-1≤ g ≤ ark with coeff (brk(1),brk(2) ), k =1, … # knots Location of break points are not fixed All coefficientsbrk(j)are given centred normal priors, Σrr(g) = 0 constraint Model 3

Non linear fit of array effect as a function of level g Model 4

Effect of normalisation on density Wildtype Knockout Before (ygr) After (ygr- r(g) ) Model 5

Hierarchical structure for gene variability in each condition • 2nd level of the model : Exchangeable hierarchical prior: g2  lognormal (μ, τ), g = 1, … N The hyper parameters μ and τ can be influential • 3rd level of the model μ  N( c, d) τ  lognormal (e, f) where the constants c, d, e and f are chosen so that the third level priors are vague Model 6

Graphical representation of the 3-level model ar ar br br r(g) g2 g μ, τ ygr r = 1:R g = 1:G Model 7

Smoothing of the gene specific variances • Variances are estimated using information from all G x R measurements (~12000 x 3) rather than just 3 • Variances are stabilised and shrunk towards average variance The implementation of the model uses WinBugs Model 8

Bayesian Model Checking • Check different possible assumptions on gene variances, e.g. equal : gr  N(0, 2) or exchangeable variances ? • Predict sample variance Sg2 new(our checking function)from the model specification • Compare predicted Sg2 newwith observed Sg2 obs Bayesian p-value Prob( Sg2 new > Sg2 obs ) • Distribution of p-values Uniform if model is ‘true’ • Easily implemented in MCMC algorithm Model 9

Bayesian model checking ar ar br br r(g) g2 ygr g μ, τ new g2 new ygr r = 1:R g = 1:G Model 10

Bayesian predictive p-values Equal variance model has too little variability for the data Exchangeable variance model is supported by the data

Samples Genes Cond 1 yg1r Cond 2 yg2r II-- Analysing gene expression data Gene expression data matrix • Gene expression data can be used in several types of analysis: • -- Comparison of gene expression under different experimental conditions, • -- Classification of gene expression profiles • -- Exploration of patterns in gene expression matrices • -- Association of gene expression with factors, e.g. prognosis

Differential expression model The quantity of interest is the difference between conditions for each gene: dg , g = 1, …,N Joint model for the 2 conditions : yg1r = g - ½ dg + 1r(g) + g1r , r = 1, … R1 yg2r = g + ½ dg + 2r(g) + g2r , r = 1, … R2 (1) • g is now the overall gene effect over the conditions • Same assumptions for the distribution of σ2gs and the modelling of sr(g) as before, s = 1, 2 • All hyper parameters are indexed by the condition Differential 2

The differences dg (≈ log fold change) or the standardised differences dg* =dg / (σ2g1 /R1 + σ2g2 /R2 )½ We obtain the joint distribution of all {dg } or {dg* } Process the output to have the distributions of the ranks { r (dg ), g = 1, …., N } Model the distribution of dgflexibly to allow a mixture of (small) subgroups of genes with ‘extreme’ dg(H1) and a (large) group of genes with dgaround 0 (H0) Possible statistics for differential expression Differential 3

Ranks of modelled log fold change dg 2.5% - 97.5% rank intervals for each gene 150 genes with lowest rank Even genes with median rank less than 100 can have large uncertainty 3 wildtype mice compared to 3 knockout mice Differential 4

Posterior probabilities for each gene to be ranked In the bottom 100 In the top 100 log fold change dg Differential 5

Mixture modelling the distribution of dg • Mixture model framework but with an ‘unknown’ number of components • Fully Bayesian hierarchical framework (the number of components is a random variable) • Based on Green (1995) and Richardson and Green (1997) papers • MCMC algorithms • Applied in an experimental context concerning bladder cancer • (Broët, Richardson and Radvanyi; Journal of Computational Biology, 9, 671-683, 2002) Mixture 1

Bayesian mixture model Normal mixtures with an unknown number of states A gene can be in different states: down regulated, …, unaffected, …, up-regulated Bayesian estimation of posterior distribution: • for number of states (besides the unaffected one) • for the allocation of genes to the states classification of states in the ‘extreme components’ or in the central one, based on their posterior probability the central state corresponding to dg≈ 0 Mixture 2

Mixture model specification dg ~ w0N(0, λ2η02) + j=1:kwjN(μj, ηj2) Prior setting λ2 > 1 expresses that the central component is expected to have a larger variance μj+ > 0 , ordered, uniform on upper range μj- < 0 , ordered, uniform on lower range {η02 , ηj2 } exchangeable, Gamma distributed Weights {w0 , wj, j = 1:k} ~ Dirichlet, uniform or other alternatives k, unknown number of components μj Mixture 3

Biological Context • Data from the Curie Institute/Research Section • (F. Radvanyi team, CNRS/Curie) • The cell line: T24 bladder cell lines • The transfection: FGFR2 cDNA (fibroblast growth factor receptor 2) • Comparison of 2 cell lines: Unmodified and Modified (transfected by FGFR2 cDNA) • Material: Nylon microarray, 4608 gene expression from the same batch, 33P labelling, 4 replicates, same experimenter, same day Aim: to study transcriptional changes induced by a defined DNA transfection in a cell line Mixture 4

Comparison of gene expression in two bladder cancer cell lines (unmodified versus transfected) Distribution of dgand QQ plot dg: Differential expression for gene g after taking into account array and cell line main effects. Negative values of dgcorrespond to up-regulated genes.

Posterior distribution of the number of mixture components Components left of central one P(0) = 0.00 P(1) = 0.11 P(2) = 0.73 P(3) = 0.14P(4) = 0.01 P(5) = 0.00 Components right of central one P(0) = 0.00 P(1) = 0.46 P(2) = 0.49 P(3) = 0.04 P(4) = 0.052 P(5) = 0.00 Support for a model with 5 components, 2 on the left and 2 on the right of the central one Mixture 5

Estimate of the posterior probability for each gene to be in each of the 5 components

Classification • Classification based on maximum a posteriori rule: • Group 1 2 3 4 5 • Nb 9 26 4540 29 4 • Mean -0.82 -0.51 0 0.51 0.86 • SD 0.16 0.08 0.14 0.10 0.17 • Weight 0.2% 0.8% 98% 0.8% 0.2% • This analysis highlights 9 up-regulated genes (expressed after transfection of the receptor) and 4 down-regulated ones, with an indication of 2 further subgroups showing some evidence of differential expression

Summary • Model different sources of variability into a single model • Borrow information from all genes to estimate gene specific variances, non linear array effect • Exploit the joint distribution of the differential expression measure through ranks or mixture models • Further work is under way • on modelling of the low level Affy probe data • on more general clustering algorithms

Sylvia Richardson, with Alex Lewin Department of Epidemiology and Public Health, Imperial College