200 likes | 368 Views
Controlling the Actual Number of False Discoveries at a Given Confidence Level. Joe Maisog BIST-530 Final Project December 3, 2008. False Discovery Rate. FDR (FPR) = proportion of positive tests which are actually false positives FDR methods control the FDR in the sense that
E N D
Controlling the ActualNumber of False Discoveriesat a Given Confidence Level Joe Maisog BIST-530 Final Project December 3, 2008
False Discovery Rate • FDR (FPR) = proportion of positive tests which are actually false positives • FDR methods control the FDR in the sense that E{FDR} q where q [0,1] is the desired level of control Benjamini and Hochberg, 1995
Korn’s Variants Korn E et al., J of Statistical Planning and Inference 124(2): 379-98 (2004).
Follow-Up Paper by Lusa et al. • Lusa L, Korn EL, McShane LM, A class comparison method with filtering-enhanced variable selection for high-dimensional data sets, Stat Med. 2008 Dec 10;27(28):5834-49. • C code (R package)
A Problem “Procedures targeting control of the expected number or proportion of false discoveries rather than the actual number or proportion can give a false sense of security. … Even with no correlation the results here [using “regular” FDR with simulated data] are troubling: 10% of the time the false discovery proportion will be 0.29 or more.” [emphasis mine]
Analogy: Accuracy vs. Precision FDR High Accuracy Low Precision High Precision Low Accuracy http://en.wikipedia.org/wiki/Accuracy
Two Jokes: Controlling ExpectationWithout a Confidence Level • Three statisticians went out hunting, and came across a large deer. The first statistician fired, but missed, by a meter to the left. The second statistician fired, but also missed, by a meter to the right. • The third statistician didn't fire, but shouted in triumph, "On the average we got it!" • With one foot in a bucket of ice water, and one foot in a bucket of boiling water, you are, on the average, comfortable. http://www.workjoke.com/statisticians-jokes.html
Korn’s Solution “[Procedures targeting control of the actual number or proportion of false discoveries] will allow statements such as ‘with 95% confidence, the number of false discoveries does not exceed 2’ or ‘with approximate 95% confidence, the proportion of false discoveries does not exceed 0.01.’ ” [emphasis mine]
Two Goals Confirm Korn’s warning that when using “regular” FDR, a fairly large fraction of false positive rates exceed the expected rate. Implement in R Korn’s method to control the actual number of false positives at a given confidence level, using the computationally efficient version.
Definition • k variables (e.g., genes) • P(1) < P(2) < . . . < P(k) are the ordered p-values from the univariate tests • H(1), H(2), . . . , H(k) are the corresponding null hypotheses • T = { t1, t2, . . . , tj } is any subset of K = { 1, 2, . . . , k } • Pr00 is the multivariate permutation distribution of p-values
1000 Simulations in R • 50 controls, 50 treatments,1000 genes • Noise ~ N(0,1), no cross-gene correlations • 100 genes “activated” in treatments with increase = 0.3969 ( p = 0.05) • “Regular” FDR method to control E{FDR} at q = 0.05 • Korn’s method to control the number of actual FP’s at u = 50, with 95% confidence
Simulated Data Matrix G1 =100 G2 = 900 N1 = 50 Ntot = 100 N2 = 50 p-values k = 1000
Results: “Regular” FDR • Mean FPR = 0.0394 (so, controlled at q = 0.05) • But 17.5% of the time, FPR > 0.05
Results: Korn’s Method • 98.9% of the time, the actual number of false positives was 50 • Controlled at u = 50 with 95% confidence
Conclusions • 17.5% of the time, FPR > q = 0.05 with “regular” FDR • Korn’s method controlled actual number of false positives at u = 50 with 95% confidence (actually slightly conservative) • Disadvantage: computationally intensive • Examining someone else’s computer program can be difficult but very rewarding!
Future Directions • Try different parameters (e.g., signal size; number of subjects, variables, or permutations), or with correlated variables • Try the method on real data • Try Korn’s “Procedure B”, which controls the actual FDR at a given confidence level • Try Lusa’s R package for feature selection
References • Benjamini, Y., and Hochberg, Y. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57: 289–300. • Korn EL, Troendle JF, McShane LM and Simon R. Controlling the number of false discoveries: application to high-dimensional genomic data. Journal of Statistical Planning and Inference 124(2): 379-398 (2004). • Lusa L, Korn EL, McShane LM, A class comparison method with filtering-enhanced variable selection for high-dimensional data sets, Stat Med. 2008 Dec 10;27(28):5834-49. R package available at: http://linus.nci.nih.gov/Data/LusaL/bioinfo/ • Westfall PF, Tobias RD, Rom D, Wolfinger RD, Hochberg Y, Multiple Comparisons and Multiple Tests, Crary, NC:SAS Institute, Inc, 1999. • A copy of the R code developed for this project can be found here: http://bist.pbwiki.com/f/bist530FinalProject.r