Controlling the Actual Number of False Discoveries at a Given Confidence Level

Controlling the ActualNumber of False Discoveriesat a Given Confidence Level Joe Maisog BIST-530 Final Project December 3, 2008

False Discovery Rate • FDR (FPR) = proportion of positive tests which are actually false positives • FDR methods control the FDR in the sense that E{FDR}  q where q [0,1] is the desired level of control Benjamini and Hochberg, 1995

Korn’s Variants Korn E et al., J of Statistical Planning and Inference 124(2): 379-98 (2004).

Follow-Up Paper by Lusa et al. • Lusa L, Korn EL, McShane LM, A class comparison method with filtering-enhanced variable selection for high-dimensional data sets, Stat Med. 2008 Dec 10;27(28):5834-49. • C code (R package)

A Problem “Procedures targeting control of the expected number or proportion of false discoveries rather than the actual number or proportion can give a false sense of security. … Even with no correlation the results here [using “regular” FDR with simulated data] are troubling: 10% of the time the false discovery proportion will be 0.29 or more.” [emphasis mine]

Analogy: Accuracy vs. Precision FDR High Accuracy Low Precision High Precision Low Accuracy http://en.wikipedia.org/wiki/Accuracy

Two Jokes: Controlling ExpectationWithout a Confidence Level • Three statisticians went out hunting, and came across a large deer. The first statistician fired, but missed, by a meter to the left. The second statistician fired, but also missed, by a meter to the right. • The third statistician didn't fire, but shouted in triumph, "On the average we got it!" • With one foot in a bucket of ice water, and one foot in a bucket of boiling water, you are, on the average, comfortable. http://www.workjoke.com/statisticians-jokes.html

Korn’s Solution “[Procedures targeting control of the actual number or proportion of false discoveries] will allow statements such as ‘with 95% confidence, the number of false discoveries does not exceed 2’ or ‘with approximate 95% confidence, the proportion of false discoveries does not exceed 0.01.’ ” [emphasis mine]

Korn’s Variants

Two Goals Confirm Korn’s warning that when using “regular” FDR, a fairly large fraction of false positive rates exceed the expected rate. Implement in R Korn’s method to control the actual number of false positives at a given confidence level, using the computationally efficient version.

Definition • k variables (e.g., genes) • P(1) < P(2) < . . . < P(k) are the ordered p-values from the univariate tests • H(1), H(2), . . . , H(k) are the corresponding null hypotheses • T = { t1, t2, . . . , tj } is any subset of K = { 1, 2, . . . , k } • Pr00 is the multivariate permutation distribution of p-values

Definition

Procedure To Control the Actual Number of False Discoveries

1000 Simulations in R • 50 controls, 50 treatments,1000 genes • Noise ~ N(0,1), no cross-gene correlations • 100 genes “activated” in treatments with increase = 0.3969 ( p = 0.05) • “Regular” FDR method to control E{FDR} at q = 0.05 • Korn’s method to control the number of actual FP’s at u = 50, with 95% confidence

Simulated Data Matrix G1 =100 G2 = 900 N1 = 50 Ntot = 100 N2 = 50 p-values k = 1000

Results: “Regular” FDR • Mean FPR = 0.0394 (so, controlled at q = 0.05) • But 17.5% of the time, FPR > 0.05

Results: Korn’s Method • 98.9% of the time, the actual number of false positives was  50 • Controlled at u = 50 with 95% confidence

Conclusions • 17.5% of the time, FPR > q = 0.05 with “regular” FDR • Korn’s method controlled actual number of false positives at u = 50 with 95% confidence (actually slightly conservative) • Disadvantage: computationally intensive • Examining someone else’s computer program can be difficult but very rewarding!

Future Directions • Try different parameters (e.g., signal size; number of subjects, variables, or permutations), or with correlated variables • Try the method on real data • Try Korn’s “Procedure B”, which controls the actual FDR at a given confidence level • Try Lusa’s R package for feature selection

References • Benjamini, Y., and Hochberg, Y. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57: 289–300. • Korn EL, Troendle JF, McShane LM and Simon R. Controlling the number of false discoveries: application to high-dimensional genomic data. Journal of Statistical Planning and Inference 124(2): 379-398 (2004). • Lusa L, Korn EL, McShane LM, A class comparison method with filtering-enhanced variable selection for high-dimensional data sets, Stat Med. 2008 Dec 10;27(28):5834-49. R package available at: http://linus.nci.nih.gov/Data/LusaL/bioinfo/ • Westfall PF, Tobias RD, Rom D, Wolfinger RD, Hochberg Y, Multiple Comparisons and Multiple Tests, Crary, NC:SAS Institute, Inc, 1999. • A copy of the R code developed for this project can be found here: http://bist.pbwiki.com/f/bist530FinalProject.r

Controlling the Actual Number of False Discoveries at a Given Confidence Level

Controlling the Actual Number of False Discoveries at a Given Confidence Level

Presentation Transcript

Controlling the False Discovery Rate: From A strophysics to B ioinformatics

“On the Number of Primes Less Than a Given Magnitude”

Discussion of The Problem of False Discoveries: How to Balance Objectives

Controlling the application of manure and chemical fertilizer at farm level

Controlling delegations: legislative vetoes and the like at the state level:

Confidence Level Through Motion

The Actual Rules for Determining the Number of Significant Figures

Discoveries at Ur

Aim: How do we find the equation of the locus of points at a given distant from a given point?

Controlling the False Discovery Rate: From Astrophysics to Bioinformatics

A Decade of Discoveries

PHOBOS Discoveries at RHIC

BUILDING MUTUAL CONFIDENCE AT NATIONAL LEVEL : INDONESIA EXPERIENCE

After the First Discoveries at RHIC

LEVEL OF CONFIDENCE / TUTORIALOUTLET DOT COM

A Man Who Assumed A Number Of False IdentitiesDreams Come True/tutorialoutletdotcom

Increases Your Self-Confidence Level

Psychology at A-level

CONTROLLING BIOLOGICAL RISKS AT THE GROUP (WORKPLACE) AND INDIVIDUAL LEVEL

Knapsack : Given a number B and a collection of n items

Geoid : the actual equipotential surface at sea level

Maximize the confidence level