130 likes | 149 Views
Proposed Bayesian model incorporates array effects in differential expression analysis, overcoming bias in normalization. Simulation shows fewer false positives compared to traditional methods.
E N D
Simultaneous Normalization and Differential Expression Alex Lewin Sylvia Richardson (IC Epidemiology) Tim Aitman (IC Microarray Centre) In collaboration with Anne-Mette Hein, Natalia Bochkina (IC Epidemiology) Helen Causton (IC Microarray Centre) Peter Green and Graeme Ambler (Bristol)
Expression level dependent normalization Many gene expression data sets need normalization which depends on expression level. Usually normalization is performed in a pre-processing step before the model for differential expression is used. These analyses ignore the fact that the expression level is measured with variability. Ignoring this variability leads to bias in the function used for normalization.
Simultaneous normalization and differential expression We propose a Bayesian model which includes array effects (normalization) in the differential expression model. Show (on simulated data) that ignoring the variability in the expression level leads to a greater number of false positives.
Bayesian hierarchical model for differential expression Data: ygsr = log gene expression for gene g, replicate r g = gene effect δg = differential effect for gene g between 2 conditions r(g)s = array effect (expression-level dependent) gs2 = gene variance • 1st level yg1r N(g – ½ δg + r(g)1 , g12), yg2r N(g + ½ δg + r(g)2 , g22), Σrr(g)s = 0, r(g)s = function of g , parameters {a} and {b} • 2nd level Priors for gδg, coefficients {a} and {b} gs2 lognormal (μs, τs)
Details of array effects (Normalization) Piecewise polynomial with unknown break points: r(g)s = quadratic in g for ars(k-1)≤ g ≤ ars(k) with coeff (brsk(1),brsk(2) ), k =1, … #breakpoints Locations of break points not fixed Must do sensitivity checks on # break points Cubic fits well for the data we are interested in
Mouse Data 3 wildtype (normal) mice compared with 3 mice with Cd36 knocked out 3 replicate arrays (wildtype mouse data) Model: posterior means E(r(g)s | data) v. E(g | data) Data:ygsr - E(g | data)
Simulated Data • 1000 genes with 3 replicates under 2 conditions • Expression levels g between 0 and 10 (log scale) • g12 log Normal (-1.8,1), g22 log Normal (-2.2,1) • 900 genes: δg= 0 • 50 genes: δg N( log(3), 0.12) • 50 genes: δg N( -log(3), 0.12) • Array effects r(g)s cubic functions of g
Two-step method • Use loess smoothing to obtain array effects loessr(g)s • Subtract loess array effects from data: yloessgsr = ygsr - loessr(g)s • Run our model on yloessgsrwith no array effects
Two-step method • yloessgsr = ygsr - loessr(g)s • ymodelgsr = ygsr - E(r(g)s | data) • Results from 2 different two-step methods are much closer to each other than to full model results.
Decision rules for selecting differentially expressed genes If P(δg > δcut | data) > pcut then gene g is called differentially expressed. We used δcut= log(3) – corresponds to null hypothesis. Various pcut – choose this according to acceptable error rate (e.g. False Discovery Rate).
Full model v. two-step method Plot observed False Discovery Rate against pcut (averaged over 5 simulations) Solid line for full model Dashed line for pre-normalized method
Discussion • More false positives if normalization carried out in a pre-processing step. • Larger slope of array effects – larger difference between full and pre-normalized models • Lewin, A., Richardson, S., Marshall C., Glazier A. and Aitman T. (2004) Bayesian Modelling of Differential Gene Expression. (under revision), available at http ://www.bgx.org.uk/