640 likes | 798 Views
False Discovery Rate Guy Yehuda. Outline. Short introduction to statistics The problem of multiplicity FDR vs. FWE FDR control procedures and resampling Results of using FDR control procedures Discussion. Short introduction to statistics.
E N D
False Discovery Rate Guy Yehuda
Outline • Short introduction to statistics • The problem of multiplicity • FDR vs. FWE • FDR control procedures and resampling • Results of using FDR control procedures • Discussion
Short introduction to statistics • (Arithmetic) Meanis the sum of all the scores divided by the number of scores: μ = ΣX/N where μ is the population mean and N is the number of scores.
The variance is a measure of how spread out a distribution is: where μ is the mean and N is the number of scores. • When the variance is computed in a sample, the statistic (where M is the mean of the sample) can be used. S2 is a biased estimate of σ2, however.
By far the most common formula for computing variance in a sample is: which gives an unbiased estimate of σ2 . • A standard deviation is simply the square root of variance. • In a normal distribution, about 68% of the scores are within one standard deviation of the mean and about 95% of the scores are within two standards deviations of the mean.
Usually, if you are given a large enough group of subjects, you will find that the distribution of just about any given trait will be a pretty bell shaped curve like the one below:
T test A T-test is used to evaluate the statistical similarity of two samples • Gather your data. • Calculate mean and then variance. • Calculate T-statistic.
Null hypothesis (H0) states that the two samples are identical. • Alternative hypothesis (H1) states that the samples are statistically different. • A result is significant if it is unlikely to have occurred by chance. • The significance level of a test is the maximum probability of accidentally rejecting a true null hypothesis (a decision known as a Type I Error). The significance of a result is also called itsp-value.
We use the degrees of freedom and the t-test in-order to find the p-value in t-statistics table. • If p-value is 0.03 and the significance level is 0.05, than the result is significant, i.e. the null hypothesis is rejected. • The power of a statistical test is the probability that the test will reject a false null hypothesis. The higher the power, the greater the chance of obtaining a statistically significant result when the null hypothesis is false.
Identifying Differentially Expressed Genes • Individual Hypotheses Testing For each gene use a significance test of 0.05 level H0: Gene is similarly expressed H1: Gene is differently expressed A t-statistic is calculated for comparing gene expression mean between the control and treatment groups.
Thousands of genes in an experiment Thousands of hypotheses are tested Probability of type I error (false discovery)committed increases sharply Multiplicity problem! Are you sure all the tests are independent?
Are you sure all the tests are independent? • Genes tend to subgroup into highly correlated expression levels for reasons such as co-regulation. • Gene expression measurement errors may be dependent due to factors related to the RNA source, the normalization process etc. We need to account for the dependency structure between the test statistics
Let’s pretend there is no (multiplicity) problem! If we deal with 10,000 genes for example, working at the individual 0.05 level, would identify, on the average, 10,000*0.05=500 differentially expressed genes, even if really no such gene exists! Do we have an extra time and money to waste exploring that amount of genes in vain at later stages? What shall we do?
What shall we do? Let’s panic!!! Control the probability of making even one error (or more) - the FamilyWise (type I) Error-rate (FWE) • The Westfall and Young step-down alg. (WY, 89) • Holm procedure (79) • Bonferroni procedure – most conservative Aren’t we missing many genes we need to explore more carefully later ?
Is there an in-between approach? Don’t worry, everything is under (FDR) control!
Benjamini and Hochberg (95) : R = # rejected hypotheses = # discoveries V of these may be in error = # false discoveries The error (type I) in the entire study is measured by The False Discovery Rate (FDR) criterion i.e. the proportion of false discoveries among the discoveries (0 if none found) FDR = E(Q) Does it make sense?
Inspecting 100 features: 3 false ones among 60 discovered - bearable 3 false ones among 4 discovered - unbearable So this error rate is adaptive The same argument holds when inspecting 10,000 So this error rate is scalable If nothing is “real” controlling the FDR at level q guarantees FWE = Prob( V ≥ 1 ) = E( V/R ) = FDR ≤ q But otherwise FWE = Prob( V ≥ 1 ) ≥ FDR So there is room for improving detection power Does it make sense?
How do we control FDR? m = #simultaneously tested null hypotheses m0 of them are true For each hypothesis Hi a test statistic is calculated along with the corresponding p-value Pi
Linear step up procedure (BH procedure) • Order the p-values P(1) ≤ P(2) ≤…≤ P(m) For a desired FDR level q : • Let • Reject • If no such k exists reject none
FDR control of the BH procedure Independent and continuous test statistics FDR q · m0 /m q Benjamini & Hochberg `95 Independent + Positive dependent FDR q · m0 /m General dependency FDR q · m0 / m ·( 1 + 1/2 + 1/3 + … 1/m) Benjamini and Yekutieli `01
Imaginary Example Let’s explore expression levels of 6 genes. A Simple research Question: find differentially expressed genes. Solution: for each gene calculate a test statistics and get the corresponding p-value Pi ; The values: 0.042, 0.007, 0.035, 0.12, 0.03, 0.00005 Test:BH procedure at q=0.05 1. Sort the 6 p-values p(1) … p(k) … p(6) 2. Rejection threshold: maximal p(k) such that p(k) 0.05·k /6 Results: k = 4 => H(1), H(2) rejected => we have found those 2 differentially expressedgenes
Adaptive procedures The BH procedure is too conservative FDR q · m0 /m If m0 is known: useq` = q · m / m0control FDR q and gain power. Benjamini & Hochberg `00 Storey `01 Mosig et al. `02 Turkheimer et al. `02
Is it always that good? • Adaptive methods offer better performance only by utilizing the difference between m / m0 and 1. • If that difference is small, i.e. the potential proportion of differentially expressed genes is small, they offer little advantage in power while their properties aren’t well established No proven FDR control
Multiplicity adjusted p-Values • The results of a multiple testing procedure can be reported as multiplicity adjusted p-values: Each adjusted p-value is compared to the desired significance level, and if smaller then the hypothesis is rejected.
FWE adjusted p-Values For an FWE controlling procedure, the adjusted p-value of an individual hypothesis is the lowest level for which FWE ≤α For instance, for the Bonferroni procedure, the adjusted p-value is simply Pi*m
FDR adjusted p-Values • For an FDR controlling procedure, the adjusted p-value of an individual hypothesis is the lowest level of FDR for which the hypothesis is first included in the set of rejected hypotheses. • Thus the adjusted p-value of P(k) using BH : Define PBH(k) = min { P(i)m/i, i ≥ k } pBH(k) ≤ q <=>for some i ≥ k, p(i) ≤ qi/m<=> H(k) is rejected at FDR level q
Resampling Drawing repeated samples from the given data. In place of the formidable formulas and mysterious tables of parametric and non-parametric tests based on complicated mathematics and arcane approximations, the basic resampling tools are simulations, created especially for the task.
Resampling • If the data contains high inter-correlations (as in our case), generally designed multiple comparisons may be over-conservative in specific dependency structure. • Resampling-based multiple testing proc. (as Westfall and Young 89) utilize the empirical dependency structure of the data to construct more powerful procedures.
Resampling • In p-value resampling, the data is repeatedly resampled under the complete null hypotheses, and a vector of resample-based p-values is computed (for each hypothesis). • V*(p) = #resampling-based p-values less than p. • V*(p) is an estimated upper bound to the expected number of p-values corresponding to true null hypotheses less than . • In simple words : we try to estimate the distribution of the p-values.
Resampling FWE adjustments • Resample the data N times. The WY proc. estimates the FWE by • The testing procedure is then simply: reject if
Resampling FDR adjustments • But FDR is also a function of the number of false null hypotheses being rejected. FDRest(p) = E { V*(p) / V*(p) + s(p) } s(p) = #false null hypotheses less than p Do we know s(p) ? No. But we can try and estimate it!
Estimation methods The two next methods use resampling to estimate the joint distribution of the p-values. • The FDR local (point) estimator is conservative on the mean. • The FDR upper limit bounds the FDR with probability 95%. More strict than the first method, and therefore less powerful. In the other hand it controls the empirical FDR with probability 0.95, hence supplies more protection.
Alternative Estimation method • Let’s use the BH procedure, but rather than using the raw p-values corresponding to the test statistics, the p-values are estimated by resampling from the marginal distribution and collapsing over all hypotheses. In simple words: averaging over all the genes, and then find the adjusted p-values and compare to q. The gain: It overcomes the discrete nature of the permutation distribution of a test statistics based on few observations (supply more possible values for the p-values).
If you insist then formally: For the k-th gene, with an observed test statistics tk, while N = #iterations and I = #genes, the estimated p-value is: And then obtain the BH point estimate for the k-th gene:
FDR controlling procedures • Use the p-values from t-tests in the BH procedure • Use the (marginal) p-values from resampling in the BH procedure • The FDR resampling “point estimate”-procedure • The FDR resampling “upper-bound”- procedure both estimate by resampling the dist’n of: FDRest(p) = E { V*(p) / (V*(p) + S(p))} Now that we know the procedures let’s test them.
The data Dudoit, Yang, Callow, Speed (2000):Statistical analysis of a lipid metabolism study in mice. • Treatment: 8 low HDL level knockout mice • Control: 8 “normal” C57B1/6 mice • Purpose: Identification of single differentially expressed genes in the livers of mice with very low HDL cholesterol levels (treatment groups) compared to inbred control mice
The micro-array data consists of 6359 genes. • Data was suitably standardized using log transformation and lowess smoother. • Compare gene expression means between the control and treatment groups.
Results of multiple testing on the original data • The p-values obtained directly from the “raw” t-statistics with 14 degrees of freedom. Ignoring multiplicity, the actual number of raw p-values larger than 0.05 is 568 (out of 6,359). Let’s examine the FWE control in this case
Using Bonferroni procedure always controls the FWE • To address multiplicity with Bonferroni at 0.05, conduct each test at level .05/6359. • 8 genes identified Bonferroni adjusted p-values: pBON(i) =p(i)*m are compared to 0.05 *The enlarged table is in the appendix
Applying the FDR controlling BH on the raw p-values, we came up with the same 8 rejected null hypotheses obtained in the original analysis. Surprising ? Not really. Take a look at the table and observe the large gap between the p-values of genes identified by the FWE controlling procedure and the other p-values. Also don’t forget the actual distribution of the test statistics is not quite the same t-distribution underlying the derivation of the p-values. So what’s the next improvement? You know it.
That’s right: Resampling • Let’s estimate the distribution of the t-statistics over 1000 resampling iterations. • Adjusted p-values were calculated using the WY alg. (FWE & resampling) and the 3 resampling-based procedures we learned. • The ten highest absolute t-values (except the largest one, 20.6, which is too far outside of the plotted x-axis) are marked on the added plot. • At the 0.05 significance level, we may still reject the same 8 hypotheses by all procedures. What about 0.1? *The enlarged graph is in the appendix
FDR and FWE adjusted p-values original data
Summary of the Results • The results of FDR control are very close to the results of the FWE control, and both are very different from the unadjusted results. • The reduced conservativeness of FDR controlling procedures does not trigger discovery of artifacts. But what if there is no clear distinction between differentially expressed genes and similarly expressed ones ?
Simulated Data Configuration • We set the number of differentially expressed genes to be fixed at 70 (~1%) • Differential expression was modeled using the weak lp-ball model described in Abramovich et al. (2000), by which a sparse signal pattern was generated: r * n 1/p * i -1/p, i=1,…,n where p is a decay rate parameter, r is the maximum of the decay function and n is the number of values. We used p = {0.5,0.75,1,1.25,1.5,1.75,2}.
Remove from the original data the differences in expression levels between the experiment and control groups that were not attributed to noise. • On each simulation repetition, shuffle the experiment and control groups. However, don’t shuffle the genes (why?). • Add the sparse signal of differential expression values to 70 randomly selected genes. • Do 100 resampling iterations. • Apply procedures described earlier. • Do 400 simulation repetitions and calculate the average FDR and power over the simulations, for.
Simulation Study Results • The next figure presents the mean curves of the adjusted p-values vs. the test statistics, for the decay rate p=2 and adjusted FDR level less than 0.25. • The maximal standard error of the estimated FDR was less than 0.003. • True FDR = the real FDR (we expect 70 genes so if we identify 80 then we know 10 are by chance). What can we learn from the “true FDR” ? *The enlarged figure is in the appendix
What can we learn from the “true FDR” ? • For all FDR controlling procedures, the adjusted p-values are larger then the corresponding true FDR, indicating that using them guarantees the control of the FDR. • The FDR procedures produce FDR adjusted p-values much closer to the true FDR than the FWE adjusted p-values obtained by the WY alg.