1 / 24

Different Expression Multiple Hypothesis Testing

Different Expression Multiple Hypothesis Testing. STAT115 Spring 2012. Outline. Differential gene expression Parametric test: t and Welch-t test Non-parametric test: permutation t and Mann-Whitney Multiple hypothesis testing Family-wide error rate, and FDR

daisy
Download Presentation

Different Expression Multiple Hypothesis Testing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Different ExpressionMultiple Hypothesis Testing STAT115 Spring 2012

  2. Outline • Differential gene expression • Parametric test: t and Welch-t test • Non-parametric test: permutation t and Mann-Whitney • Multiple hypothesis testing • Family-wide error rate, and FDR • Affy detection (present/absent calls) Tongji 2009

  3. Normalized & Summarized Data 5 Normal and 9 Myeloma (MM) Samples Samples Genes Tongji 2009

  4. Identify Differentially Expressed Genes • Understand what is the difference between two conditions / samples • Disease pathways • Find disease markers for diagnosis • Diagnosis chips • Interested in genes with: • Statistical significance: observed differential expression is unlikely to be due to chance • Biological significance: observed differential expression is sufficient of biological relevance Tongji 2009

  5. Identification of Diagnostic Genes Classical study of cancer subtypes Golub et al. (1999) Tongji 2009

  6. Identify Differentially Expressed Genes • Fold change • Parametric test (assume expression value follows normal distribution) • T test and Welch-t test • Non-parametric test (no assumption of expression distribution) • Permutation t-test and Mann-Whitney U (Wilcoxon rank sum) test • Non-parametric is good only if you have plenty of samples to choose from • Expression with 3 treatment and 3 controls are better off with regular t or Welch-t statistic Tongji 2009

  7. Fold Change • Naïve method • Avg(X) / Avg(Y) • May not be a good measure of differential expression, especially for less abundant transcripts • Note on scale: • Natural scale: MAS4, MAS5, dChip • Log scale: RMA, need to take exp() before calculating fold change Tongji 2009

  8. Two Sample t-test • Statistical significance in the two sample problem Group 1: X1, X2, … Xn1 Group 2: Y1, Y2, … Yn2 • If Xi ~ Normal (μ1, σ2), Yi ~ Normal (μ2, σ2) • Null hypothesis of μ1= μ2 Tongji 2009

  9. Two Sample t-test • Statistical significance in the two sample problem Group 1: X1, X2, … Xn1 Group 2: Y1, Y2, … Yn2 • If Xi ~ Normal (μ1, σ12), Yi ~ Normal (μ2, σ22) • Null hypothesis of μ1= μ2 • Use Welch-t statistic • Check T table for p-val • A gene with small p-val (very big or small t) • Reject null • Significant difference between normal and MM Tongji 2009

  10. Permutation Test • Non-parametric method for p-val calculation • Do not assume normal expression distribution • Do not assume the two groups have equal variance • Randomly permute sample label, calculate t to form the empirical null t distribution • For MM-study, (14 choose 5) = 2002 different t values from permutation • If the observed t extremely high/low  differential expression with statistical significance Tongji 2009

  11. Permutation Technique Compute T0 Compute T1 Compute T2 Compute T3 Compare T0 to T* set Tongji 2009

  12. Wilcoxon Rank Sum Test • Rank all data in row, count sum of ranks TT or TC • Significance calculated from permutation as well • E.g. 10 normal and 10 cancer • Min(T) = 55 • Max(T) = 155 • Significance(T=150) • Check U table (transformation of T) for stat significance • Intuition similar to permutation t-test Tongji 2009

  13. Multiple Hypotheses Testing • We test differential expression for every gene with p-value, e.g. 0.01 • If there are ~15 K genes on the array, potentially 0.01 x 15K = 150 genes wrongly called • H0: no diff expr; H1: diff expr • Reject H0: call something to be differentially expressed • Should control family-wise error rate or false discovery rate • Use Affy’s present/absent calls Tongji 2009

  14. Family-Wise Error Rate • P(false rejection at least one hypothesis) < α P(no false rejection ) > 1- α • Bonferroni correction: to control the family-wise error rate for testing m hypotheses at level α, we need to control the false rejection rate for each individual test at α/m • If α is 0.05, for 15K gene prediction, p-value cutoff is 0.05/15K = 3.33 E-6 • Too conservative for differentially expressed gene selection Tongji 2009

  15. False Discovery Rate V: type I errors, false positives T: type II errors, false negatives FDR = V / R, FP / all called Tongji 2009

  16. False Discovery Rate • Less conservative than family-wise error rate • Benjamini and Hochberg (1995) method for FDR control, e.g. FDR ≤ * • Draw all m genes, ranked by p-val • Draw line y = x * / m, x = 1…m • Call all the genes below the line Tongji 2009

  17. FDR Threshold Genes ranked by p-val x * / m line Tongji 2009

  18. SAM for FDR Control • Statistical Analysis of Microarrays (SAM), Tusher et al. PNAS 2001 • With small number of samples, there could be small  and very big t by chance • SAM: modified t*, increase  based on  of other genes on the array (i.e. lowest 5 percentile of ) • Proceeds with regular FDR Tongji 2009

  19. Q-value • Storey & Tibshirani, PNAS, 2003 • Empirically derived q-value • Every p-value has its corresponding q-value (FDR) • FDR’s academic vs practical values Tongji 2009

  20. PM MM Present (P) PM MM Absent (A) Affymetrix Detection • MAS 5.0 makes an absent/marginal/present call for each probeset • Define R = (PM-MM)/(PM+MM) • R near 1 means PM>>MM, abundant transcript • R near or below 0 means PM <= MM • R should make cutoff () to be considered present Tongji 2009

  21. Affymetrix Detection •  (default 0.015)empirically set by Affy • Detection p-value from Wilcoxon signed rank test • Rank probes by (PM-MM) / (PM+MM) -  • T+: 25, T-: -20, n = 9 • Check T+ against Wilcoxon Table (n) for p-value Tongji 2009

  22. P-value of a probe set Present Marginal Absent a1a2 Default: 0.04 0.06 Affymetrix Detection • 1 and 2 are user defined values but have optimized defaults in MAS5 • Since expression index for low abundant transcripts is unreliable, it is better to find differentially expressed genes only from present call genes • Increasing  can reduce FDR, but true present calls could be lost Tongji 2009

  23. Outline • Differential gene expression • Parametric test: t and Welch-t test • Non-parametric test: permutation t and Mann-Whitney • Multiple hypothesis testing • Family-wide error rate and FDR • Find diff expr genes only on Affy present calls Tongji 2009

  24. Acknowledgment • Kevin Coombes & Keith Baggerly • Mark Craven • Georg Gerber • Gabriel Eichler • Ying Xie • Terry Speed & Group • Larry Hunter • Wing Wong & Cheng Li • Mark Reimers • Jenia Semyonov Tongji 2009

More Related