1 / 18

Empirical Bayes Analysis of Variance Component Models for Microarray Data

Empirical Bayes Analysis of Variance Component Models for Microarray Data. S. Feng, 1 R.Wolfinger, 2 T.Chu, 2 G.Gibson, 3 L.McGraw 4 1. Department of Statistics, North Carolina State University, 2. SAS institute, Cary, NC 27513;

Jims
Download Presentation

Empirical Bayes Analysis of Variance Component Models for Microarray Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Empirical Bayes Analysis of Variance Component Modelsfor Microarray Data S. Feng,1 R.Wolfinger,2 T.Chu,2 G.Gibson,3 L.McGraw4 1. Department of Statistics, North Carolina State University, 2. SAS institute, Cary, NC 27513; 3. Department of Genetics, North Carolina State University, 4. Department of Genetics and Development, Cornell University.

  2. Microarray:Genome-wide gene expression 1. Originally just a term in Engineering: millions of small electrodes arrayed on a slide (Silicon) 2. Introduced to Genetics/Genomics in 1996: ...Thousands of DNA sequences arrayed on a glass slide – the genome-wide gene expression profiles can be investigated simultaneously. In JSM 2004, - More than 30 sections - More than 100 stat. Papers/posters

  3. Two major types of Microarray cDNA Array Oligonucleotide Array Naturecell biology   Aug. 2001 v3 (8) More refs: NatureReview of Genetics, May. 2004 v5 (5)

  4. Oligonucleotide Array PM MM Lipshutz et al; 1999; Nature Genetics, 21(1): 20-24. 16-20 Probe pairs / Probe Set 1Probe Set / Gene This is “per gene”. The PM/MM effects are considered as fixed effects. Chu, T., Weir, B., and Wolfinger, R. (02, 04).

  5. Statistical problems in Microarray? Terry Speed homepage: www.stat.berkeley.edu/users/terry/zarray/Html • - Multiple testing: P-values. • - Variable selection. • - Discrimination. • - Clustering • of samples. • of genes. • Time course experiments. • Clinical trails • Merck, GSK … • Gene networks - pathway - Planning of experiments: design. sample sizes. - Quality Control: Var in RNA samples. Var among array. - Background Subtraction. - Normalization - Significance Analysis Supervised vs. Non-supervised Statistical Computing

  6. Oligonucleotide array (supervised) Trt 1 Trt 2 . Trt k Gene 1 . Gene 2 . Gene 3 . . . . . . . . . . . . . . . . Gene n An example: common problem in genome-wide studies: The “Large p, small n” problems. Small n: number of replications – low statistical power Large p: number of features (genes, probes, bio-markers…) – multiple-testing problems ? Computation …? Significance Analysis (challenge) Question: For the contrasts of interest (i.e., Trt1 vs. Trt2), what genes are differently (significantly) expressed?

  7. line1 line2 line3 line4 line5 Male 5 Trt X X X X X X Female Female flies killed, mRNA prepared (random effect 1) (1) (2) (3) (4) (9) (10) … … … … … … (Random effect 2) … … … … … … Chip19 Chip20 Chip1 Chip2 PM MM (fixed effect) ~15000 genes, for each gene: (…Random effect 3) Data from McGraw Lab., Cornell Univ. Research Interest: To investigate the effects of different male Drosophila genotypes (5 lines) on post-mating gene expression in female flies

  8. ygijkl Gene 1 Gene 2 Indices i: Trt1 Trt2 Trt3 Trt4 Trt5 . . . ……σgij Indices j: Prep1 Prep2 Gene g Indices k: Chip1 Chip2 ………..σgijk . . . . . . . Indices l: … ……………..σgijkl 1, 2, 3,…, 19, 20 Gene 15000 Total: 5x2x2x20=400 points for each gene g Data from McGraw Lab., Cornell Linear Mixed Model: (for each gene g) Ygijkl = Gg + (G*trt)gi + (G*Probe)gl + (G*trt*prep)gij + (G*trt*prep*chip)gijk + γgijkl.

  9. Significant Expressed Genes: by SGA Possible Reasons: 1. … lower level analysis … 2.Large p: Multiple Testing problems (15000x10 tests) FWR vs. FDR? (not addressed in this study.) 3.Small n: Low power in each single test: - poorly estimated VC; - small d.f. in testing … In this study, trying to improve power in each single test…

  10. Black: 15000 estimated VC1 Red: 15000 estimated VC2 Blue: 15000 estimated VC3 Our Idea:Taking advantage from “large p” Perform SGA (by mixed model), obtain VC estimates: Gene 1 (VC1, VC2, VC3); Gene 2 (VC1, VC2, VC3); … … Gene 15000 (VC1, VC2, VC3); Note: Not the “density” of each VC … This plot does contain useful “global” information on each VC (range, “HDR”…). The “global” infor. is taken as the “prior”. (SGA – pilot analysis)

  11. Our Empirical Bayes Approach A 7-step algorithm: 1. Apply SGA to get 15000 VC estimates; 2. Transform to the “ANOVA Components (AC)” ; 3. Apply Jeffrey’s prior (non-informative) ; 4. Fit Inverted Gamma (IG) to each AC (prior density); ----------------------------------------------------------------------------- 5. Derive the posterior density (and the posterior estimate) of each AC; 6. Transform the posterior estimate of AC back to VC (reverse step 2); 7. Mixed model analysis: fix the VC value to be the posterior estimates of VC (the EB estimator of VC), and approx. by standard normal dist.

  12. Real Data Example: Cornell Data Significance Test:

  13. Design – structure mimic the true data: Parameters are set to be the estimated value from the true data set. For the 3 VC, σgij=0.01, σgijk=0.015, σgijkl=0.072 5500 genes are simulated, among which: 500 are “significantly expressed” and 5000 are “non-significantly expressed”, with Trt mean: Simulation Studies

  14. Simulation Results (1) EB estimator vs. REML estimator: Bias, Variance and MSE: The bias, variance and MSE of EB are only fractions of those of SGA

  15. Simulation Results (2): The null distribution of the test t statistics: 1. SGA (red, expected to be t distribution with df=5); 2. EB with df=30 (blue); 3. EB with df=1000 (green); 4. Truth (black, expected to be standard normal distribution ).

  16. Simulation Results (3): Test Size and Power Calculation: Mean: 0.0498 0.0490 0.0572 0.0497 75.91% 94.72% 99.53% 100%

  17. However, the Prior is estimated from data: “large p” prior likely be good! “small p” prior may not be good … … for “small p”, EB estimator can be biased! Gain in MSE not guaranteed! Discussion Why EB estimator “beats” REML estimator? - Prior density contains “truth” information. Q: How to control (large p vs. small p)? (Controlling system for EB method: determine the shrinkage process to get maximum gain in MSE)

  18. Applications Especially desined for Microarray: – Microarray (cDNA, Oligonucleotide); – Proteomics Extension to general data sets (Mixed model), if controlling system built (in the near future).

More Related