390 likes | 406 Views
Bayesian Analysis of Differential Gene Expression in Insulin Resistance. Alex Lewin (Imperial College) Sylvia Richardson ( IC Epidemiology) Tim Aitman (IC Microarray Centre) In collaboration with Anne-Mette Hein, Natalia Bochkina ( IC Epidemiology) Helen Causton (IC Microarray Centre)
E N D
Bayesian Analysis of Differential Gene Expression in Insulin Resistance Alex Lewin (Imperial College) Sylvia Richardson (IC Epidemiology) Tim Aitman (IC Microarray Centre) In collaboration with Anne-Mette Hein, Natalia Bochkina (IC Epidemiology) Helen Causton (IC Microarray Centre) Peter Green (Bristol)
Insulin Resistance • Bayesian Hierarchical Model for Differential Expression • Simultaneous estimation of normalization and differential expression • Gene Ontology analysis for differentially expressed genes
Insulin Resistance Syndrome Obesity Hypertension Hyperlipidemia Type II Diabetes Insulin Resistance Syndrome Insulin: signals to cells to absorb glucose, maintains level of glucose in the blood. Evidence for genetic basis. Complex disorder, mechanism unknown.
Insulin-resistance genes Spontaneously hypertensive rat (SHR): a model of human insulin resistance syndromes. Quantitative Trait Loci mapping of - glucose metabolism - fatty acid metabolism - hypertriglyceridaemia - hypertension single region on chromosome 4
Insulin-resistance gene Cd36 cDNA microarray: hybridisation signal for SHR much lower than for Brown Norway and SHR.4 control strains Aitman et al 1999, Nature Genet 21:76-83
Larger microarray experiment: look for other genes associated with Cd36 Microarray Data 3 SHR compared with 3 transgenic rats (with Cd36) 3 wildtype (normal) mice compared with 3 mice with Cd36 knocked out 12000 genes on each array Biological Question Find genes which are expressed differently between animals with and without Cd36.
Insulin Resistance • Bayesian Hierarchical Model for Differential Expression • Simultaneous estimation of normalization and differential expression • Gene Ontology analysis for differentially expressed genes
Low-level Model (how gene expression is estimated from signal) Normalisation (to make arrays comparable) Clustering, Partition Model Differential Expression Microarray analysis is amulti-step process We aim to integrate all the steps in a common statistical framework
Bayesian Modelling Framework • Model different sources of variability simultaneously, within array, between array … • Uncertainty propagated from data to parameter estimates (so not over-optimistic in conclusions). • Share information in appropriate ways to get robust estimates. • Clear principle for inference.
Bayesian hierarchical model for differential expression Data: ygsr = log expression for gene g, condition s, replicate r g = gene effect δg = differential effect for gene g between 2 conditions r(g)s = array effect (expression-level dependent) gs2 = gene variance • 1st level yg1r | g, δg, g1 N(g – ½ δg + r(g)1 , g12), yg2r | g, δg, g2 N(g + ½ δg + r(g)2 , g22), Σrr(g)s = 0 r(g)s = function of g , parameters {a} and {b}
Bayesian hierarchical model for differential expression • 2nd level Priors for g, δg, coefficients {a} and {b} gs2 | μs, τs lognormal (μs, τs) Hyper-parameters μs and τs can be influential. In a full Bayesian analysis, these arenot fixed • 3rd level μs N( c, d) τs lognormal (e, f)
ARRAY EFFECTS 11 21 12 22 13 23 DIFFERENTIAL EFFECTS GENE EFFECTS 1 1 2 2 y111 y121 y211 y221 y122 y212 y222 y112 y213 y113 y123 y223 DATA 11 12 21 22 VARIANCES 1 2
Details of array effects (Normalization) Piecewise polynomial with unknown break points: r(g)s = quadratic in g for ars(k-1)≤ g ≤ ars(k) with coeff (brsk(1),brsk(2) ), k =1, … #breakpoints Locations of break points not fixed Must do sensitivity checks on # break points Cubic fits well for the data we are interested in
cubic loess Non linear fit of array effect as a function of gene effect
Wildtype Knockout Effect of normalisation on density Before (ygr) ^ After (ygr- r(g) )
Gene specific variances are stabilised • Variances estimated using information from all G x R measurements (~12000 x 3) rather than just 3 • Variances stabilised and shrunk towards average variance
Decision rules for selecting differentially expressed genes If P( |δg| > δcut | data) > pcut then gene g is called differentially expressed. δcut chosen according to biological hypothesis of interest. pcut corresponds to the error rate (e.g. False Discovery Rate or Mis-classification Penalty) considered acceptable.
Posterior Probability of Differential Expression x pg > 0.83 (FDR=10%) Δ t-statistic > 2.78 (95% CI)
Insulin Resistance • Bayesian Hierarchical Model for Differential Expression • Simultaneous estimation of normalization and differential expression • Gene Ontology analysis for differentially expressed genes
Expression level dependent normalization Many gene expression data sets need normalization which depends on expression level. Usually normalization is performed in a pre-processing step before the model for differential expression is used. These analyses ignore the fact that the expression level is measured with variability. Ignoring this variability leads to bias in the function used for normalization.
Mouse Data 3 replicate arrays (wildtype mouse data) Model: posterior means E(r(g)s | data) v. E(g | data) Data:ygsr - E(g | data)
Simulated Data • 1000 genes with 3 replicates under 2 conditions • Expression levels g between 0 and 10 (log scale) • g12 log Normal (-1.8,1), g22 log Normal (-2.2,1) • 900 genes: δg= 0 • 50 genes: δg N( log(3), 0.12) • 50 genes: δg N( -log(3), 0.12) • Array effects r(g)s cubic functions of g
Two-step method • Use loess smoothing to obtain array effects loessr(g)s • Subtract loess array effects from data: yloessgsr = ygsr - loessr(g)s • Run our model on yloessgsrwith no array effects
Two-step method • yloessgsr = ygsr - loessr(g)s • ymodelgsr = ygsr - E(r(g)s | data) • Results from 2 different two-step methods are much closer to each other than to full model results.
Decision rules for selecting differentially expressed genes If P( |δg| > δcut | data) > pcut then gene g is called differentially expressed. δcut chosen according to biological hypothesis of interest (here we use log(3)). pcut corresponds to the error rate (e.g. False Discovery Rate or Mis-classification Penalty) considered acceptable.
Full model v. two-step method Plot observed False Discovery Rate against pcut (averaged over 5 simulations) Solid line for full model Dashed line for pre-normalized method
Insulin Resistance • Bayesian Hierarchical Model for Differential Expression • Simultaneous estimation of normalization and differential expression • Gene Ontology analysis for differentially expressed genes
Gene Ontology (GO) Database of biological terms Arranged in graph connecting related terms Directed Acyclic Graph: links indicate more specific terms ~16,000 terms from QuickGO website (EBI)
Gene Ontology (GO) from QuickGO website (EBI)
Gene Annotations • Genes/proteins annotated to relevant GO terms • Gene may be annotated to several GO terms • GO term may have 1000s of genes annotated to it (or none) • Gene annotated to term A annotated to all ancestors of A
GO annotations of genes associated with the insulin-resistance gene Cd36 Compare GO annotations of genes most and least differentially expressed Most differentially expressed ↔ pg > 0.5 (280 genes) Least differentially expressed ↔ pg < 0.2 (11171 genes)
genes not annot. to GO term genes annot. to GO term A B genes most diff. exp. C D genes least diff. exp. GO annotations of genes associated with the insulin-resistance gene Cd36 For each GO term, Fishers test on proportion of differentially expressed genes with annotations v. proportion of non-differentially expressed genes with annotations observed O = A expected E = C*(A+B)/(C+D) FatiGO website http://fatigo.bioinfo.cnio.es/
GO annotations of genes associated with the insulin-resistance gene Cd36 O = observed no. differentially expressed genes E = expected no. differentially expressed genes
Biological process Physiological process Response to stimulus Organismal movement Response to external stimulus (O=12, E=4.7) Response to biotic stimulus (O=14, E=6.9) Response to stress (O=12, E=5.9) Response to wounding (O=6, E=1.8) Response to external biotic stimulus * Defense response (O=11, E=5.8) Response to pest, pathogen or parasite (O=8, E=2.6) Immune response (O=9, E=4.5) Inflammatory response (O=4, E=1.2) All GO ancestors of Inflammatory response * This term was not accessed by FatiGO Relations between GO terms were found using QuickGO: http://www.ebi.ac.uk/ego/
Further Work to do on GO Account for dependencies • Between GO terms • Between genes Multiple testing corrections Multiple probesets for the same gene Uncertainty in annotation
Summary • Bayesian hierarchical model can estimate variances robustly • Simultaneous estimation of normalization with differential expression results in fewer false positives than with a pre-processing step • Several relevant GO terms are over-represented in the genes which are most associated with the insulin-resistance gene Cd36
Lewin, A., Richardson, S., Marshall C., Glazier A. and Aitman T. (2005) Bayesian Modelling of Differential Gene Expression, Biometrics (in press). available at http ://www.bgx.org.uk/ One Day Meeting on Data Fusion in Genomics, 7 September 2005 Institute for Mathematical Sciences, Imperial College