Multiple Testing in Impact Evaluations: Discussant Comments

Multiple Testing in Impact Evaluations: Discussant Comments IES Research Conference June 11, 2008 Larry L. Orr

The Guidelines • Just a couple of comments on Peter’s presentation: • Sensible advice, masterfully presented – I urge you to adopt them • On the open issue of adjusting exploratory tests, I come down on the side of adjusting (with lower significance threshold) • Focus remainder of remarks on an issue on which Peter was (appropriately) agnostic – which adjustment and what is its effect on power? IES Research Conference

The Guidelines • Just a couple of comments on Peter’s presentation: • Sensible advice, masterfully presented – I urge you to adopt them • On the open issue of adjusting exploratory tests, I come down on the side of adjusting (with lower significance threshold) • Focus remainder of remarks on an issue on which Peter was (appropriately) agnostic – which adjustment and what is its effect on power? Disclaimer: I was a member of the working group that developed the guidelines Peter presented. My remarks today represent my own views, not those of the working group. IES Research Conference

Different adjustments deal with different issues • Many (e.g., Bonferroni, Holm, Tukey-Kramer) test for a nonzero Family-wise Error Rate (FWER) – i.e., for any nonzero effects. That’s not usually what concerns us • Typical situation: we have some set of estimates that are significant by conventional standards; we want to be assured that most of them reflect real effects – i.e., we’re concerned with the False Discovery Rate • Benjamini-Hochberg attempts to control the false discovery rate IES Research Conference

The False Discovery Rate (FDR) • FDR = proportion of significant estimates that are false positives (Type I errors) • Example: Suppose we have: • 20 statistically significant estimates • 8 true nonzero impacts • 12 are false positives • FDR = .6 (= 12/20) • Low FDR is good IES Research Conference

An Example • Suppose we estimate impacts on 4 outcomes for each of the following subgroups: • Gender • Ethnicity (4 groups) • Region (4 groups) • School size (2 groups) • Central city/Suburban/Rural • SES (3 groups) • Number siblings (4 groups) • Pretest score (3 groups) • 100 estimates – not atypical for an education study IES Research Conference

Example (cont’d) • Suppose 10 estimates are significant at .05 level • That might reflect: • 10 true nonzero impacts • 9 true nonzero impacts and 1 false positive • 8 true nonzero impacts and 2 false positives • … • Expected mix = 5 true nonzero impacts, 5 false positives; this would imply FDR = 50% IES Research Conference

Example (cont’d) • Suppose 10 estimates are significant at .05 level • That might reflect: • 10 true nonzero impacts • 9 true nonzero impacts and 1 false positive • 8 true nonzero impacts and 2 false positives • … • Expected mix = 5 true nonzero impacts, 5 false positives; this would imply FDR = 50% But you can never know what the actual mix is, and you cannot know which is which IES Research Conference

Expected FDR as function of proportion true nonzero impacts(assumes no MC adjustment, significance level = .05; power = .80) IES Research Conference

Implications • When only 5% of all true impacts are nonzero, FDR = .5 – i.e., half of the significant estimates are likely to be Type I errors (but you cannot know which ones they are!) • FDR is quite high until proportion of true impacts that are nonzero rises above 25% • Only when > 50% of true impacts are nonzero, is the FDR relatively low (< .06) IES Research Conference

Simulations • Real education data from the ECLS-K Demo • 4 Outcomes: reading, math, attendance, peers • 25 subgroups (see earlier list) • Imputed zero or nonzero (ES=.2) impacts for varying proportions of subgroups • Measured FDR with and w/o B-H correction • 500 replications of 100 estimates IES Research Conference

Simulation results: FDR as function of true zero impact rate, unadjusted vs. B-H adjusted Based on 500 replications of estimated impacts on 4 outcomes for 25 subgroups with simulated effect size (ES) = 0 or ES = .20, using data from the ECLS-K Demonstration IES Research Conference

Implications • B-H does indeed control the FDR in real-world education data (at least, in these RW education data) • Even at very low nonzero impact rates, FDR is well below 5% • This comes at a price, however… IES Research Conference

The effect of the B-H adjustment on Type II errors Adjusted Unadjusted Based on 500 replications of estimated impacts on 4 outcomes for 25 subgroups with simulated effect size (ES) = 0 or ES = .20, using data from the ECLS-K Demonstration IES Research Conference

The cost of adjusting for multiple comparisons within a fixed sample • For a given sample, reducing the chance of Type I errors (false positives) increases the chance of Type II errors (missing true effects) • In this case, for very low nonzero impact rates, Type II error rate for a typical subgroup (probability of missing a true effect when there is one) went from .28 to .70 (i.e., power fell from .74 to .30!) • For high nonzero impact rates, the power loss is much smaller – when nonzero impact rate is 95%, adjustment increases Type II error rate only from .27 to .33 (i.e., power falls from .73 to .67) IES Research Conference

Does this mean we must sacrifice power to deal with multiple comparisons? • Yes – if you have already designed your sample ignoring the MC problem. • BUT…if you take the adjustment into account at the design stage, you can build the loss of power associated with MC adjustments into the sample size, to maintain power • This means, of course, larger samples and more expensive studies (sorry about that, Phoebe) IES Research Conference

For a copy of this presentation… Send an e-mail to: Larry.Orr@comcast.net IES Research Conference

Multiple Testing in Impact Evaluations: Discussant Comments