170 likes | 267 Views
Multiple Testing in Impact Evaluations: Discussant Comments. IES Research Conference June 11, 2008 Larry L. Orr. The Guidelines. Just a couple of comments on Peter’s presentation: Sensible advice, masterfully presented – I urge you to adopt them
E N D
Multiple Testing in Impact Evaluations: Discussant Comments IES Research Conference June 11, 2008 Larry L. Orr
The Guidelines • Just a couple of comments on Peter’s presentation: • Sensible advice, masterfully presented – I urge you to adopt them • On the open issue of adjusting exploratory tests, I come down on the side of adjusting (with lower significance threshold) • Focus remainder of remarks on an issue on which Peter was (appropriately) agnostic – which adjustment and what is its effect on power? IES Research Conference
The Guidelines • Just a couple of comments on Peter’s presentation: • Sensible advice, masterfully presented – I urge you to adopt them • On the open issue of adjusting exploratory tests, I come down on the side of adjusting (with lower significance threshold) • Focus remainder of remarks on an issue on which Peter was (appropriately) agnostic – which adjustment and what is its effect on power? Disclaimer: I was a member of the working group that developed the guidelines Peter presented. My remarks today represent my own views, not those of the working group. IES Research Conference
Different adjustments deal with different issues • Many (e.g., Bonferroni, Holm, Tukey-Kramer) test for a nonzero Family-wise Error Rate (FWER) – i.e., for any nonzero effects. That’s not usually what concerns us • Typical situation: we have some set of estimates that are significant by conventional standards; we want to be assured that most of them reflect real effects – i.e., we’re concerned with the False Discovery Rate • Benjamini-Hochberg attempts to control the false discovery rate IES Research Conference
The False Discovery Rate (FDR) • FDR = proportion of significant estimates that are false positives (Type I errors) • Example: Suppose we have: • 20 statistically significant estimates • 8 true nonzero impacts • 12 are false positives • FDR = .6 (= 12/20) • Low FDR is good IES Research Conference
An Example • Suppose we estimate impacts on 4 outcomes for each of the following subgroups: • Gender • Ethnicity (4 groups) • Region (4 groups) • School size (2 groups) • Central city/Suburban/Rural • SES (3 groups) • Number siblings (4 groups) • Pretest score (3 groups) • 100 estimates – not atypical for an education study IES Research Conference
Example (cont’d) • Suppose 10 estimates are significant at .05 level • That might reflect: • 10 true nonzero impacts • 9 true nonzero impacts and 1 false positive • 8 true nonzero impacts and 2 false positives • … • Expected mix = 5 true nonzero impacts, 5 false positives; this would imply FDR = 50% IES Research Conference
Example (cont’d) • Suppose 10 estimates are significant at .05 level • That might reflect: • 10 true nonzero impacts • 9 true nonzero impacts and 1 false positive • 8 true nonzero impacts and 2 false positives • … • Expected mix = 5 true nonzero impacts, 5 false positives; this would imply FDR = 50% But you can never know what the actual mix is, and you cannot know which is which IES Research Conference
Expected FDR as function of proportion true nonzero impacts(assumes no MC adjustment, significance level = .05; power = .80) IES Research Conference
Implications • When only 5% of all true impacts are nonzero, FDR = .5 – i.e., half of the significant estimates are likely to be Type I errors (but you cannot know which ones they are!) • FDR is quite high until proportion of true impacts that are nonzero rises above 25% • Only when > 50% of true impacts are nonzero, is the FDR relatively low (< .06) IES Research Conference
Simulations • Real education data from the ECLS-K Demo • 4 Outcomes: reading, math, attendance, peers • 25 subgroups (see earlier list) • Imputed zero or nonzero (ES=.2) impacts for varying proportions of subgroups • Measured FDR with and w/o B-H correction • 500 replications of 100 estimates IES Research Conference
Simulation results: FDR as function of true zero impact rate, unadjusted vs. B-H adjusted Based on 500 replications of estimated impacts on 4 outcomes for 25 subgroups with simulated effect size (ES) = 0 or ES = .20, using data from the ECLS-K Demonstration IES Research Conference
Implications • B-H does indeed control the FDR in real-world education data (at least, in these RW education data) • Even at very low nonzero impact rates, FDR is well below 5% • This comes at a price, however… IES Research Conference
The effect of the B-H adjustment on Type II errors Adjusted Unadjusted Based on 500 replications of estimated impacts on 4 outcomes for 25 subgroups with simulated effect size (ES) = 0 or ES = .20, using data from the ECLS-K Demonstration IES Research Conference
The cost of adjusting for multiple comparisons within a fixed sample • For a given sample, reducing the chance of Type I errors (false positives) increases the chance of Type II errors (missing true effects) • In this case, for very low nonzero impact rates, Type II error rate for a typical subgroup (probability of missing a true effect when there is one) went from .28 to .70 (i.e., power fell from .74 to .30!) • For high nonzero impact rates, the power loss is much smaller – when nonzero impact rate is 95%, adjustment increases Type II error rate only from .27 to .33 (i.e., power falls from .73 to .67) IES Research Conference
Does this mean we must sacrifice power to deal with multiple comparisons? • Yes – if you have already designed your sample ignoring the MC problem. • BUT…if you take the adjustment into account at the design stage, you can build the loss of power associated with MC adjustments into the sample size, to maintain power • This means, of course, larger samples and more expensive studies (sorry about that, Phoebe) IES Research Conference
For a copy of this presentation… Send an e-mail to: Larry.Orr@comcast.net IES Research Conference