Needles in Haystacks: Are There Any? How Many Are There? Where Are They?

Needles in Haystacks:Are There Any?How Many Are There?Where Are They? John Rice University of California, Berkeley

Outline • Classical testing: significance levels, p-values, power • Testing many hypotheses: issues and recent developments (false discovery rate) • “Higher Criticism: are any null hypotheses false? • Motivation: the Taiwanese American Occultation Survey • Estimating the proportion of false nulls

Classical Testing H0: null hypothesis vs HA: alternative hypothesis T: test statistic Reject H0 for large values of T, say T > t0 (threshold) Type I error: reject H0 when it holds Significance level = Prob(Type I error)=P(T>t0 |H0) Fix  and find t0 by considering only the null distribution of T P-value: If observe T=t, P-value = Prob(T>t|H0). Under H0, with continuity, the distribution of P-value is uniform on [0,1] Type II error: reject HA when it holds Power: 1 - Prob(Type II error)

Multiple Testing Many null and alternative hypotheses; e.g. source detection --- each pixel is either background or source Collection of test statistics and P-values, one for each hypothesis. May or may not be independent random variables Possible questions: Are any null hypotheses false? How many are false? Which ones are false? Or probabilities of such.

Analogues of Type I Error Per-Comparison Error Rate (PCER): E(V/m)Ignores multiplicity and use significance level , e.g.  Per-Family Error Rate (PFER): E(V) Family-Wise Error Rate (FWER): P(V>0) Latter two can be controlled by Bonferroni, e.g. m

Recent Analogues False Discovery Proportion:FDP = V/R False Discovery Rate:FDR = E(FDP) Positive FDR: p-FDR = E(FDP| R>0) Exceedance Control: P(FDP > c) The probability of at most k false rejections given at least k hypotheses are true: k-FWER

False Discovery Rate Determination of FDR threshold for desired level = E(V/m) Order P-values P1 < P2 < …< Pm Find d = max { j: Pj < j/m} Reject all hypotheses with PkPd Quantity controlled by FRD can be more meaningful than that controlled by PCER which treats 10 false detections out of 20 detections the same as 10 out of 2000.

FDR line t(p)= p/ Empirical distribution of P-values Uniform distribution Note that threshold is chosen adaptively, compared to threshold for PCER which controls E(V/m), by, say, a k threshold. For example, adapts to distribution of source intensity relative to background intensity

Hopkins et al.

Higher Criticism Are there any false nulls, any sources? Are there any needles in the haystack? Test statistic is based on comparing the distribution of P-values to a uniform distribution -- are there too many small ones? Expect i P-values  i/n Donoho & Jin

Consider a large number of tests for a rare but moderately strong signal. There are scenarios in which it can be determined that there are signals but not determine which tests correspond to signals. The smallest few P-values will not correspond to signals. strength sparsity

Estimating the Proportion Seemingly harder question: what is the proportion of needles in the haystack? Motivation: The Taiwanese-American Occultation Survey (TAOS) will search for Kuiper Belt Objects (KBOs), by monitoring star fields for occultations.

Occultations

Time series of flux Occultation by an asteroid on two cameras Thousands of stars will be simultaneously monitored every night, searching for rare events lasting about 1/5 second. In the course of a year, will try to detect 10-1000(?) occultations among 1010-1012 measurements! Simulated occultation

Proposed Detection Scheme Consider basing test on flux from a single hold. Consider a particular star Initial data: fkh = flux from star on telescope k, hold h=1,…,n will be used for calibrating subsequent test statistics. New observation to be tested for possible occultation: Yk Rk = rank of Yk among the fkh Test statistic: the product of the Rk

Construction is based on the following fact: If Y1,…,Yn are iid and Y is independent of them with the same distribution, then Thus, the null distribution of the product of the ranks can be calculated explicitly. Or an approximation to the log of the product can be made by treating the ranks as independent uniform random variables.

Retrospective Estimation of Occultation Rate Suppose have a year of data. What can we say about the occultation rate (and thus the abundance of KBOs)? Note distinction between this question and identifying individual occultations in real time.

The problem: • Given a very large number of independent hypothesis tests, where in the vast majority of cases the null hypothesis is true, estimate the proportion of false null hypotheses. • The power of the test is unknown and varies from test to test. • The distribution of the test statistic under the alternative is not known. • We would like to be able to state at a specified level of confidence that there are at least a specified number of false null hypotheses.

Suppose a proportion of the tests correspond to false null hypotheses. Then the distribution of the p-values is Lower bound: Empirical version of numerator:

Motivation for construction: want to bound the contributions from the true nulls to Suppose there exists such that Since the proportion of p-values greater than can be attributed to false nulls. Thus a (biased) estimate of the proportion of false nulls:

Lower confidence bound: Thus can state, for example, that with 90% confidence there were at least 777 occultations. Note that there is no meaningful upper bound, because occultations could be arbitrarily shallow. Analysis shows that there are scenarios in which the proportion of false nulls can be consistently estimated but in which one cannot identify which nulls were false.

Surprise! You would think that estimating the proportion of false nulls is harder than testing whether any nulls were false, but for the normal model presented earlier, when you can do one, you can do the other. Cai, Jin & Low

References Y. Benjamini and Y. Hochberg (1995). Controlling the false discovery rate. J. Royal Stat. Soc. B. 57, 289. T. Cai, J. Jin, and M. Low (2005). Estimation and confidence sets for sparse normal mixtures. www.stat.purdue.edu/~jinj/Research/ESTEPS.pdf D. Donoho & J. Jin (2004). Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics 32, 962 C. Genovese and L. Wasserman (2005). Exceedence control of the false discovery proportion. http://www.stat.cmu.edu/~genovese/papers/exceedance.pdf A. Hopkins et al (2002). A new source detection algorithm using the false discovery rate. Astr. J. 123, 1086 N. Meinshausen and J. Rice (2006). Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses. Annals of Statistics 32(1), in press. Sofware(R ): cran.r-project.org/doc/packages/howmany.pdf C. Miller et al (2001). Controlling the false discovery rate in astrophysical data analysis. Astr. J 122, 3492 J. Shaffer (2005). Recent developments towards optimality in multiple hypothesis testing. Contact shaffer@stat.berkeley.edu J. Storey (2002). A direct approach to false discovery rates. J. Roy. Stat. Soc. B.64, 479 M. van der Laan, S. Dudoit, and K. Pollard (2004). Augmentation procedures for control of the generalized familywise error rate and tail probabilities for the proportion of false positives. Statistical Applications in Genetics and Molecular Biology 3, Article 15 There are many additional relevant references and the literature is rapidly evolving. Those given above are for starters and contain further references.

Needles in Haystacks: Are There Any? How Many Are There? Where Are They?