190 likes | 412 Views
Empirical Bayes Analysis of Variance Component Models for Microarray Data. S. Feng, 1 R.Wolfinger, 2 T.Chu, 2 G.Gibson, 3 L.McGraw 4 1. Department of Statistics, North Carolina State University, 2. SAS institute, Cary, NC 27513;
E N D
Empirical Bayes Analysis of Variance Component Modelsfor Microarray Data S. Feng,1 R.Wolfinger,2 T.Chu,2 G.Gibson,3 L.McGraw4 1. Department of Statistics, North Carolina State University, 2. SAS institute, Cary, NC 27513; 3. Department of Genetics, North Carolina State University, 4. Department of Genetics and Development, Cornell University.
Microarray:Genome-wide gene expression 1. Originally just a term in Engineering: millions of small electrodes arrayed on a slide (Silicon) 2. Introduced to Genetics/Genomics in 1996: ...Thousands of DNA sequences arrayed on a glass slide – the genome-wide gene expression profiles can be investigated simultaneously. In JSM 2004, - More than 30 sections - More than 100 stat. Papers/posters
Two major types of Microarray cDNA Array Oligonucleotide Array Naturecell biology Aug. 2001 v3 (8) More refs: NatureReview of Genetics, May. 2004 v5 (5)
Oligonucleotide Array PM MM Lipshutz et al; 1999; Nature Genetics, 21(1): 20-24. 16-20 Probe pairs / Probe Set 1Probe Set / Gene This is “per gene”. The PM/MM effects are considered as fixed effects. Chu, T., Weir, B., and Wolfinger, R. (02, 04).
Statistical problems in Microarray? Terry Speed homepage: www.stat.berkeley.edu/users/terry/zarray/Html • - Multiple testing: P-values. • - Variable selection. • - Discrimination. • - Clustering • of samples. • of genes. • Time course experiments. • Clinical trails • Merck, GSK … • Gene networks - pathway - Planning of experiments: design. sample sizes. - Quality Control: Var in RNA samples. Var among array. - Background Subtraction. - Normalization - Significance Analysis Supervised vs. Non-supervised Statistical Computing
Oligonucleotide array (supervised) Trt 1 Trt 2 . Trt k Gene 1 . Gene 2 . Gene 3 . . . . . . . . . . . . . . . . Gene n An example: common problem in genome-wide studies: The “Large p, small n” problems. Small n: number of replications – low statistical power Large p: number of features (genes, probes, bio-markers…) – multiple-testing problems ? Computation …? Significance Analysis (challenge) Question: For the contrasts of interest (i.e., Trt1 vs. Trt2), what genes are differently (significantly) expressed?
line1 line2 line3 line4 line5 Male 5 Trt X X X X X X Female Female flies killed, mRNA prepared (random effect 1) (1) (2) (3) (4) (9) (10) … … … … … … (Random effect 2) … … … … … … Chip19 Chip20 Chip1 Chip2 PM MM (fixed effect) ~15000 genes, for each gene: (…Random effect 3) Data from McGraw Lab., Cornell Univ. Research Interest: To investigate the effects of different male Drosophila genotypes (5 lines) on post-mating gene expression in female flies
ygijkl Gene 1 Gene 2 Indices i: Trt1 Trt2 Trt3 Trt4 Trt5 . . . ……σgij Indices j: Prep1 Prep2 Gene g Indices k: Chip1 Chip2 ………..σgijk . . . . . . . Indices l: … ……………..σgijkl 1, 2, 3,…, 19, 20 Gene 15000 Total: 5x2x2x20=400 points for each gene g Data from McGraw Lab., Cornell Linear Mixed Model: (for each gene g) Ygijkl = Gg + (G*trt)gi + (G*Probe)gl + (G*trt*prep)gij + (G*trt*prep*chip)gijk + γgijkl.
Significant Expressed Genes: by SGA Possible Reasons: 1. … lower level analysis … 2.Large p: Multiple Testing problems (15000x10 tests) FWR vs. FDR? (not addressed in this study.) 3.Small n: Low power in each single test: - poorly estimated VC; - small d.f. in testing … In this study, trying to improve power in each single test…
Black: 15000 estimated VC1 Red: 15000 estimated VC2 Blue: 15000 estimated VC3 Our Idea:Taking advantage from “large p” Perform SGA (by mixed model), obtain VC estimates: Gene 1 (VC1, VC2, VC3); Gene 2 (VC1, VC2, VC3); … … Gene 15000 (VC1, VC2, VC3); Note: Not the “density” of each VC … This plot does contain useful “global” information on each VC (range, “HDR”…). The “global” infor. is taken as the “prior”. (SGA – pilot analysis)
Our Empirical Bayes Approach A 7-step algorithm: 1. Apply SGA to get 15000 VC estimates; 2. Transform to the “ANOVA Components (AC)” ; 3. Apply Jeffrey’s prior (non-informative) ; 4. Fit Inverted Gamma (IG) to each AC (prior density); ----------------------------------------------------------------------------- 5. Derive the posterior density (and the posterior estimate) of each AC; 6. Transform the posterior estimate of AC back to VC (reverse step 2); 7. Mixed model analysis: fix the VC value to be the posterior estimates of VC (the EB estimator of VC), and approx. by standard normal dist.
Real Data Example: Cornell Data Significance Test:
Design – structure mimic the true data: Parameters are set to be the estimated value from the true data set. For the 3 VC, σgij=0.01, σgijk=0.015, σgijkl=0.072 5500 genes are simulated, among which: 500 are “significantly expressed” and 5000 are “non-significantly expressed”, with Trt mean: Simulation Studies
Simulation Results (1) EB estimator vs. REML estimator: Bias, Variance and MSE: The bias, variance and MSE of EB are only fractions of those of SGA
Simulation Results (2): The null distribution of the test t statistics: 1. SGA (red, expected to be t distribution with df=5); 2. EB with df=30 (blue); 3. EB with df=1000 (green); 4. Truth (black, expected to be standard normal distribution ).
Simulation Results (3): Test Size and Power Calculation: Mean: 0.0498 0.0490 0.0572 0.0497 75.91% 94.72% 99.53% 100%
However, the Prior is estimated from data: “large p” prior likely be good! “small p” prior may not be good … … for “small p”, EB estimator can be biased! Gain in MSE not guaranteed! Discussion Why EB estimator “beats” REML estimator? - Prior density contains “truth” information. Q: How to control (large p vs. small p)? (Controlling system for EB method: determine the shrinkage process to get maximum gain in MSE)
Applications Especially desined for Microarray: – Microarray (cDNA, Oligonucleotide); – Proteomics Extension to general data sets (Mixed model), if controlling system built (in the near future).