Guidelines for Addressing the Multiple Comparisons Problem in Impact Evaluations: Examples and Applications

The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and ApplicationsPeter Z. Schochet and John Deke June 2009, IES Research Conference

What Is the Problem? • Multiple hypothesis tests are often conducted in impact studies • Outcomes • Subgroups • Treatment groups • Standard testing methods could yield: • Spurious significant impacts • Incorrect policy conclusions 2

Overview of Presentation • Background • Testing guidelines adopted by IES • Examples of their use by the RELs • New guidance on statistical methods for “between-domain” analyses 3

Background

Assume a Classical Hypothesis Testing Framework • Test H0j: Impactj = 0 • Reject H0j if p-value of t-test < =.05 • Chance of finding a spurious impact is 5 percent for each test alone 5

But If Tests Are Considered Together and No True Impacts… Probability 1 t-test Number of TestsaIs Statistically Significant 1 .05 5 .23 10 .40 20 .64 50 .92 aAssumes independent tests 6

Impact Findings Can Be Misrepresented • Publishing bias • A focus on “stars” 7

Adjustment Procedures Lower Levels for Individual Tests • Methods control the “combined” error rate • Many available methods: • Bonferroni: Compare p-values to (.05 / # of tests) • Fisher’s LSD, Holm (1979), Sidak (1967), Scheffe (1959), Hochberg (1988), Rom (1990), Tukey (1953) • Resampling methods (Westfall and Young 1993) • Benjamini-Hochberg (1995) 8

These Methods Reduce Statistical Power: The Chances of Finding RealEffects Simulated Statistical Powera Number of Tests UnadjustedBonferroni 5 .80 .59 10 .80 .50 20 .80 .41 50 .80 .31 a Assumes 1,000 treatments and 1,000 controls, 20 percent of all null hypotheses are true, and independent tests 9

Basic Testing GuidelinesBalance Type I and II Errors

Problem Should Be Addressed by First Structuring the Data • Structure will depend on the research questions, previous evidence, and theory • Adjustments should not be conducted blindly across all contrasts 11

The Plan Must Be Specified Up Front • To avoid “fishing” for findings • Study protocols should specify: • Data structure • Confirmatory analyses • Exploratory analyses • Testing strategy 12

Delineate Separate Outcome Domains • Based on a conceptual framework • Represent key clusters of constructs • Domain “items” are likely to measure the same underlying trait (have high correlations) • Test scores • Teacher practices • Student behavior 13

Testing Strategy: Both Confirmatory and Exploratory Components • Confirmatory component • Addresses central study hypotheses • Used to make overall decisions about program • Must adjust for multiple comparisons • Exploratory component • Identify impacts or relationships for future study • Findings should be regarded as preliminary 14

Focus of Confirmatory Analysis Is on Experimental Impacts • Focus is on key child outcomes, such as test scores • Targeted subgroups: eg. ELL students • Some experimental impacts could be exploratory • Subgroups • Secondary child and teacher outcomes 15

Confirmatory Analysis Has Two Potential Parts • Domain-specific analysis • Between-domain analysis 16

Domain-Specific Analysis: Test Impacts for Outcomes as a Group • Create a composite domain outcome • Weighted average of standardized outcomes • Equal weights • Expert judgment • Predictive validity weights • Factor analysis weights • MANOVA not recommended • Conduct a t-test on the composite 17

Between-Domain Analysis: Test Impacts for Composites Across Domains • Are impacts significant in all domains? • No adjustments are needed • Are impacts significant in anydomain? • Adjustments are needed • Discussed later 18

Application of Guidelines by the Regional Educational Labs

Basic Features of the REL Studies • 25 Randomized Control Trials • Single treatment and control groups • Testing diverse interventions • Typically grades K-8 • Fall-spring data collection, some longer • Collecting data on teachers and students 20

Each RCT Provided a Detailed Analysis Plan to IES Each Plan Included Information on: • Confirmatory research questions • Confirmatory domains and outcomes • Within- and between-domain testing strategy • Study samples • Statistical power levels 21

Key Features of Confirmatory Domains • Student academic achievement domains are specified in all RCTs • Some domains pertain to: • Behavioral outcomes • A specific time period for longitudinal studies • Subgroups: ELL students 22

Most RCTs Have Specified Structured Research Questions • Most have fewer than 3 domains • Some have only 1 • Most domains have a small number of outcomes • Main between-domain question: “Are there positive impacts in any domain?” 23

Adjustment Methods for Between-Domain Confirmatory Analyses

Focus on Methods to Control the Familywise Error Rate (FWER) • FWER = Prob (find ≥1 significant impact given that no impacts truly exist) • Preferred over the false discovery rate developed by Benjamini-Hochberg (BH) • BH is a preponderance-of-evidence method • BH does not control the FDR for all forms of dependencies across test statistics 25

Consider Four FWER Adjustment Methods • Sidak: Exact adjustment when tests are independent • Bonferroni: Approximate adjustment when tests are independent • Generalized Tukey: Adjusts for correlated tests that follow a multivariate t-distribution • Resampling: Robust adjustment for correlated tests for general distributions 26

Main Research Questions • How do these four methods work? • Are the more complex methods likely to provide more powerful tests for between-domain analyses? • There are no single-routine statistical packages for the complex methods under clustered designs 27

Basic Setup for the Between-Domain Analysis • Assume N domain composites • Test whether any domain composite is statistically significant • Aim to control the FWER at = .05 • All methods reduce the level for individual tests: * = .05/fact 28

Sidak • Uses the relation that the FWER = [1 – Pr(correctly rejecting all N null hypotheses)] • For independent tests, FWER = 1 – (1- *)N • Sidak picks *so that FWER = 0.05 • For example, if N = 3: • * = 0.017 • fact = 0.05/ 0.017 = 2.949 29

The Bonferroni Method Tends to Be More Conservative • * = (.05 / N); fact = N The Value of fact for the Sidak and Bonferroni 30

Sidak and Bonferroni Are Likely To Be Conservative with Correlated Tests • Correlated tests can occur if: • Domain composites are correlated • Treatment effects are heterogeneous • Yields tests with lower power 31

Generalized Tukey and Resampling Methods Adjust for Correlated Tests • Let pi be the p-value from test i • Both methods use the relation: FWER = Pr(min(p1, p2, p3,…, pN)≤.05 | H0 is true) • Both methods calculate FWER using the distribution of min(p1, p2, p3,…, pN) or max(t1, t2, t3,…, tN) 32

Generalized Tukey • Assumes test statistics have multivariate tdistributions with known correlations • The MULTCOMP package in R can implement this adjustment (Hothorn, Bretz, Westfall 2008) • Multi-stage procedure that requires user inputs 33

Using the MULTCOMP Package • Inputs are a vector of impact estimates and the corresponding variance-covariance matrix • Challenge is to get cross-equation covariances of the impact estimates • One option: use the suest command in STATA, then copy resulting covariance matrix to R • Uses GEE rather than HLM to adjust for clustering 34

Resampling/Bootstrapping • The distribution of the maximum t-statistic can be estimated through resampling (Westfall and Young 1993) • Allows for general forms of correlations and outcome distributions • Resampling must be performed “under the null hypothesis” 35

Homoskedastic Bootstrap Algorithm • Calculate impacts and tstats using the original data • Define Y* as the residuals from these regressions • Repeat the following at least 10,000 times: • Randomly sample schools, with replacement, from Y* • Randomly assign sampled schools to treatment and control groups in the same proportion as in the original data • Calculate impacts and save the maximum absolute tstat • Adjusted p-values = proportion of maximum tstats that lie above the absolute value of the original tstats 36

Example of Resampling Method Original tstats are 0.793 and 3.247; Adjusted p-values are 0.89 and 0.00 a1 = Max tstat > 0.793; 2 = Max tstat > 3.247 37

Implementation of Resampling • The MULTTEST procedure in SAS implements resampling, but only for non-clustered data • Simple approach: Aggregate data to the school level, and use MULTTEST • More complex approach: Write a program to implement the algorithm with clustering 38

Comparing Methods • Assume 3 composite domain outcomes with correlations of 0.20, 0.50, and 0.80 • Outcomes are normally distributed or heavily skewed normals (focus on skewed) • Four types of comparisons: • FWER • Values of fact • Minimum Detectable Effect Size (MDES) • “Goal Line” scenario 39

FWER Values Are Similar by Method Except With Large Correlations FWER Values, by Method and Test Correlations 40

Values of fact Are Similar by Method Except With Large Correlations Values of fact, by Method and Test Correlations 41

All Methods Yield Similar MDES MDE Values, by Method and Test Correlationsa aAssumes 60 schools, 60 students per school, R2=0.50, ICC=0.15 42

“Goal Line” Scenario: The Method Could Matter for Marginally Significant Impacts Adjusted p-values, by Method and Test Correlationsa aAssumes 60 schools, 60 students per School, R2=0.50, ICC=0.15 43

Summary and Conclusions • Multiple comparisons guidelines: • Specify confirmatory analyses in study protocols • Delineate outcome domains • Conduct hypothesis tests on domain composites • RELs have implemented guidelines 44

Summary and Conclusions • Adjustments are needed for between-domain analyses • For calculating MDEs in the design stage, using the Bonferroni is sufficient • For estimating impacts, the more complex methods may be preferred in “goal-line situations” when test correlations are large 45

References and Contact Information • Guidelines in Multiple Testing in Impact Evaluations (Schochet 2008) • ies.ed.gov/ncee/pubs/20084018.asp • Resampling-Based Multiple Testing (Westfall and Young 1993; John Wiley and Sons) • pschochet@mathematica-mpr.com • jdeke@mathematica-mpr.com 46

Guidelines for Addressing the Multiple Comparisons Problem in Impact Evaluations: Examples and Applications

Guidelines for Addressing the Multiple Comparisons Problem in Impact Evaluations: Examples and Applications

Presentation Transcript

FTA CONFERENCE DENVER, JUNE 2, 2009

June 2009

Programmatic Research on Early Literacy: Several Key Findings IES 3 rd Annual Research Conference June 12, 2008

Turning Around Chronically Low-Performing Schools: A Practice Guide ————————————— IES Research Conference June 12, 200

U.S. Department of Education Doing What Works Website Highlighting Effective Practices IES Research Conference June 12,

RCI Competitiveness Conference June 17, 2009

19th XBRL International Conference – June 25, 2009

Clerks Conference – 4 th June 2009

June 2009, IES Research Conference

Follow-up Conference Call June 25, 2009

June 2009

2009 RESEARCH CONFERENCE ON RESEARCH INTEGRITY MAY 15-17, 2009

ERES 2009 Conference Stockholm, 24-27 June 2009

EDEN 2009 Annual Conference 10-13 June 2009 Gdansk, Poland

JUNE 2009

Representatives Conference June 2009

Beyond the Ordinary conference York, June 2009

Estates Management Conference – 30 June 2009

IES Research Conference Michael S. Garet June 8, 2009

June, 2009

Juan Carlos Calcagno Mathematica Policy Research, Inc. IES Research Conference June 11 th , 2008

Data Matters A research funder’s perspective 2 nd Communia Conference, June 2009