240 likes | 475 Views
Different Expression Multiple Hypothesis Testing. STAT115 Spring 2012. Outline. Differential gene expression Parametric test: t and Welch-t test Non-parametric test: permutation t and Mann-Whitney Multiple hypothesis testing Family-wide error rate, and FDR
E N D
Different ExpressionMultiple Hypothesis Testing STAT115 Spring 2012
Outline • Differential gene expression • Parametric test: t and Welch-t test • Non-parametric test: permutation t and Mann-Whitney • Multiple hypothesis testing • Family-wide error rate, and FDR • Affy detection (present/absent calls) Tongji 2009
Normalized & Summarized Data 5 Normal and 9 Myeloma (MM) Samples Samples Genes Tongji 2009
Identify Differentially Expressed Genes • Understand what is the difference between two conditions / samples • Disease pathways • Find disease markers for diagnosis • Diagnosis chips • Interested in genes with: • Statistical significance: observed differential expression is unlikely to be due to chance • Biological significance: observed differential expression is sufficient of biological relevance Tongji 2009
Identification of Diagnostic Genes Classical study of cancer subtypes Golub et al. (1999) Tongji 2009
Identify Differentially Expressed Genes • Fold change • Parametric test (assume expression value follows normal distribution) • T test and Welch-t test • Non-parametric test (no assumption of expression distribution) • Permutation t-test and Mann-Whitney U (Wilcoxon rank sum) test • Non-parametric is good only if you have plenty of samples to choose from • Expression with 3 treatment and 3 controls are better off with regular t or Welch-t statistic Tongji 2009
Fold Change • Naïve method • Avg(X) / Avg(Y) • May not be a good measure of differential expression, especially for less abundant transcripts • Note on scale: • Natural scale: MAS4, MAS5, dChip • Log scale: RMA, need to take exp() before calculating fold change Tongji 2009
Two Sample t-test • Statistical significance in the two sample problem Group 1: X1, X2, … Xn1 Group 2: Y1, Y2, … Yn2 • If Xi ~ Normal (μ1, σ2), Yi ~ Normal (μ2, σ2) • Null hypothesis of μ1= μ2 Tongji 2009
Two Sample t-test • Statistical significance in the two sample problem Group 1: X1, X2, … Xn1 Group 2: Y1, Y2, … Yn2 • If Xi ~ Normal (μ1, σ12), Yi ~ Normal (μ2, σ22) • Null hypothesis of μ1= μ2 • Use Welch-t statistic • Check T table for p-val • A gene with small p-val (very big or small t) • Reject null • Significant difference between normal and MM Tongji 2009
Permutation Test • Non-parametric method for p-val calculation • Do not assume normal expression distribution • Do not assume the two groups have equal variance • Randomly permute sample label, calculate t to form the empirical null t distribution • For MM-study, (14 choose 5) = 2002 different t values from permutation • If the observed t extremely high/low differential expression with statistical significance Tongji 2009
Permutation Technique Compute T0 Compute T1 Compute T2 Compute T3 Compare T0 to T* set Tongji 2009
Wilcoxon Rank Sum Test • Rank all data in row, count sum of ranks TT or TC • Significance calculated from permutation as well • E.g. 10 normal and 10 cancer • Min(T) = 55 • Max(T) = 155 • Significance(T=150) • Check U table (transformation of T) for stat significance • Intuition similar to permutation t-test Tongji 2009
Multiple Hypotheses Testing • We test differential expression for every gene with p-value, e.g. 0.01 • If there are ~15 K genes on the array, potentially 0.01 x 15K = 150 genes wrongly called • H0: no diff expr; H1: diff expr • Reject H0: call something to be differentially expressed • Should control family-wise error rate or false discovery rate • Use Affy’s present/absent calls Tongji 2009
Family-Wise Error Rate • P(false rejection at least one hypothesis) < α P(no false rejection ) > 1- α • Bonferroni correction: to control the family-wise error rate for testing m hypotheses at level α, we need to control the false rejection rate for each individual test at α/m • If α is 0.05, for 15K gene prediction, p-value cutoff is 0.05/15K = 3.33 E-6 • Too conservative for differentially expressed gene selection Tongji 2009
False Discovery Rate V: type I errors, false positives T: type II errors, false negatives FDR = V / R, FP / all called Tongji 2009
False Discovery Rate • Less conservative than family-wise error rate • Benjamini and Hochberg (1995) method for FDR control, e.g. FDR ≤ * • Draw all m genes, ranked by p-val • Draw line y = x * / m, x = 1…m • Call all the genes below the line Tongji 2009
FDR Threshold Genes ranked by p-val x * / m line Tongji 2009
SAM for FDR Control • Statistical Analysis of Microarrays (SAM), Tusher et al. PNAS 2001 • With small number of samples, there could be small and very big t by chance • SAM: modified t*, increase based on of other genes on the array (i.e. lowest 5 percentile of ) • Proceeds with regular FDR Tongji 2009
Q-value • Storey & Tibshirani, PNAS, 2003 • Empirically derived q-value • Every p-value has its corresponding q-value (FDR) • FDR’s academic vs practical values Tongji 2009
PM MM Present (P) PM MM Absent (A) Affymetrix Detection • MAS 5.0 makes an absent/marginal/present call for each probeset • Define R = (PM-MM)/(PM+MM) • R near 1 means PM>>MM, abundant transcript • R near or below 0 means PM <= MM • R should make cutoff () to be considered present Tongji 2009
Affymetrix Detection • (default 0.015)empirically set by Affy • Detection p-value from Wilcoxon signed rank test • Rank probes by (PM-MM) / (PM+MM) - • T+: 25, T-: -20, n = 9 • Check T+ against Wilcoxon Table (n) for p-value Tongji 2009
P-value of a probe set Present Marginal Absent a1a2 Default: 0.04 0.06 Affymetrix Detection • 1 and 2 are user defined values but have optimized defaults in MAS5 • Since expression index for low abundant transcripts is unreliable, it is better to find differentially expressed genes only from present call genes • Increasing can reduce FDR, but true present calls could be lost Tongji 2009
Outline • Differential gene expression • Parametric test: t and Welch-t test • Non-parametric test: permutation t and Mann-Whitney • Multiple hypothesis testing • Family-wide error rate and FDR • Find diff expr genes only on Affy present calls Tongji 2009
Acknowledgment • Kevin Coombes & Keith Baggerly • Mark Craven • Georg Gerber • Gabriel Eichler • Ying Xie • Terry Speed & Group • Larry Hunter • Wing Wong & Cheng Li • Mark Reimers • Jenia Semyonov Tongji 2009