490 likes | 614 Views
Comparing Results from RCTs and Quasi-Experiments that share the same Intervention Group. Thomas D. Cook Northwestern University. Why RCTs are to be preferred. Statistical theory re expectations
E N D
Comparing Results from RCTs and Quasi-Experiments that share the same Intervention Group Thomas D. Cook Northwestern University
Why RCTs are to be preferred • Statistical theory re expectations • Relative advantage over other bias-free methods--e.g., regression-discontinuity (RDD) and instrumental variables (IV) • Ad hoc theory and research on implementation • Privileged credibility in science and policy • Claim that non-exp. alternatives routinely fail to produce similar causal estimates
Dissimilar Estimates • Come from empirical studies comparing exp. and non-exp. results on same topic • Strongest are within-study comparisons • These take an experiment, throw out the control group, and substitute a non-equivalent comparison group • Given the intervention group is a constant, this is a test of the different control groups
Within-Study Comparison Lit. • 20 studies, mostly in job training. Of the 14 in job training reviews contend: • (1) no study produces a clearly similar causal estimate, including Deheija & Wahba • (2) Some design and analysis features associated with less bias, but still bias • (3) the average of the experiments is not different from the average of the non-experiments--but be careful here and note the variance of the effect sizes differs by design type
Brief History of Literature on Within Study Comparisons • LaLonde; Fraker & Maynard • 12 subsequent studies in job training • Extension to examples in education in USA and social welfare in Mexico, never yet reviewed
Policy Consequences • Department of Labor, as early as 1985 • Health and Human Services, job training and beyond • National Academy of Sciences • Institute of Educational Sciences • Do within-study comparisons deserve all this?
We will: • Deconstruct „non-experiment“ and compare experimental estimates to • 1. Regression-discontinuity estimates • 2. Estimates from difference-of-differences (fixed effects) design • Ask: Is general conclusion about the inadequacy of non-experiments true across at least these different kinds of non-experiment
Criteria of Good Within-Study Comparison Design 1. Variation in mode of assignment--random or not 2. No third variables correlated with both assignment and outcome--e.g., measurement 3. Randomized experiment properly executed 4. Quasi-experiment good instance of “type” 5. Both design types estimate the same causal entity--e.g, LATE in regression-discontinuity 6. Acceptable criteria of correspondence between design types--ESs seem similar; not formally differ; stat significance patterns not differ, etc.
Three Known within-Study Comparisons of Exp and R-D • Aiken, West et al (1998)- R-D study; experiment; LATE; analysis; results • Buddelmeyer & Skoufias (2003)-R-D study; experiment; LATE; analysis; results • Black, Galdo & Smith (2005)-R-D study; experiment; LATE; analysis; results
Comments on R-D vs Exp. • Cumulative correspondence demonstrated over three cases • Is this theoretically trivial, though? • Is it pragmatically significant, given variation in implementation in both the experiment and R-D? • As “existence proof”, it belies over-generalized argument that non-experiments don’t work • As practical issue, does it mean we should support RDD when treatments are assigned by need, merit. • Emboldens to deconstruct non-experiment further
Experiment vs Differences-in-Differences • Most frequent non-experimental design by far across many fields of study • Also modal in within-study comparisons in job training, and so it provides major basis for past opinion that non-experiments are routinely biased • We review: 3 studies with comparable estimates • 14 job training studies with dissimilar estimates • 2 education examples with dissimilar estimates
Bloom et al • Bloom et al (2002; 2005)--job training the topic • Experiment 11 sites - 8 pre earning waves; 20 post • Non-Experiment = 5 within-state comparisons; 4 within-city; all comparison Ss enrolled in welfare • We present only control/comparison contrast because treatment time series is a constant
Issue is: • Is there overall difference between control groups randomly or non-randomly formed? • If yes, can statistical controls—OLS, IV (incl. Heckman models), propensity scores, random growth models—eliminate this difference? • Tested 1O modes, but only one longitudinal • Why we treat this as d-in-d rather than ITS
Implications of Bloom et al • Averaging across the 4 within-city sites showed no difference-also true if 5th between-city site added • Selecting within-study comparisons obviated the need for statistical adjustments for non-equivalence--design alone did it. • Bloom et al tested differential effects of statistical adjustments in between-state comparisons where there were large differences • None worked, or did better than OLS
Aiken et al (1998) Revisited The experiment. Remember that sample was selected on narrow range of test score values • Quasi-Experiment--sample selection limited to students who register late or cannot be found in summer but who score in the same range as the experiment • No differences between experiment and non-experiment on test scores or pretest writing tests • Measurement identical in experiment and non-exp
Results for Aiken et al • Writing standardized test = .59 and .57 - sig • Rated essay = .06 and .16 – ns • High degree of comparability in statistical test results and effect size estimates
Implications of Aiken et al • Like Bloom et al, careful selection of sample gets close correspondence on important observables. • Little need for stat adjustment for non-equivalence limited only to unobservables • Statistical adjustment minor compared to use of sampling design to construct initial correspondence
What happens if there is an initial selection difference? • Shadish, Luellen & Clark (2006)
Figure 1: Design of Shadish et al. (2006) N = 445 Undergraduate Psychology Students Pretests, and then Random Assignment to Randomized Experiment n = 235 Randomly Assigned to Nonrandomized Experiment n = 210 Self-Selected into Mathematics Training n = 79 Vocabulary Training n = 131 Mathematics Training n = 119 Vocabulary Training n = 116 All participants measured on both mathematics and vocabulary outcomes
What’s special in Shadish et al • Variation in mode of assignment • Hold constant most other factors thru first RA--population/measures /activity patterns • Good experiment? Pretests; short-term and attrition; no chance for contamination. • Good quasi-experiment? - selection process; quality of measurement; analysis and role of Rosenbaum
Implications of Shadish et al • Here the sampling design produced non- equivalent groups on observables, unlike Bloom • Here the statistical adjustments worked when computed as propensity scores • However, big overlap in experimental and non-experimental scores due to first stage random assignment, making propensity scores more valid • Extensive, unusually valid measurement of a relatively simple selection process, though not homogeneous.
Limitations to Shadish et al • What about more complex settings? • What about more complex selection processes? • What about OLS and other analyses? • This is not a unique test of propensity scores!
Examine Within-Study Comparison Studies with different Results • The Bulk of the Job Training Comparisons • Two Examples from Education
Earliest Job Training Studies: Adding to Smith/Todd Critique • Mode of Assignment clearly varied • We assume RCT implemented reasonably well • But third variable irrelevancies were not controlled, esp location and measurement, given dependence on matching from extant data sets • Large initial differences between randomly and non-randomly formed comparison groups • Reliance on statistical adjustment to reduce selection, and not initial design
Agodini & M. Dynarski (2004) • Drop-out prevention experiment, 16 m/h schools • Individual students, likely dropouts, were randomly assigned within schools—16 replicates • Quasi-Experiment—students matched from 2 quite different sources: middle school controls in another study, and national NELS data. • Matching on individual and school demographic factors • 4 outcomes examined and so in non-experiment • 128 propensity scores -16 x 4 x 2--computed basically from demographic background variables
Results • Only 29 of 128 cases were balanced matches obtained • Why quality matching so rare? In non-experiment, groups hardly overlap. Treatment group is high and middle schools, but comparisons are middle only or from a very non-local national data set • Mixed pattern of outcome correspondences in 29 cases of computable propensity scores. Not good • OLS did as well as propensity scores
Critique • Who would design a quasi-experiment this way? Is a mediocre non-experiment being compared to a good experiment? • Alternative design might have been: • 1. Regression-discontinuity. • 2. Local comparison schools, same selection mechanism to select similar comparison students. 3 Use of multi-year prior achievement data.
Wilde & Hollister (2005) • The Experiment—reducing class size in 11 sites; no pretest used at the individual level • Quasi-experimental design—individuals in reduced classes matched to individual cases from other 10 sites • Propensity scores; mostly demographic • Analysis treat each site as a separate experiment • And so 11 replicates comparing an experimental and non-experimental effect size
Results • Low level of correspondence in experimental and non-experimental effect sizes across the 11 sites • So for each site it makes a causal difference whether experiment or quasi-experiment • When aggregated across sites, results closer: exp = .68; non-exp = 1.07 • But they do reliably differ
Critique • Who would design a quasi-exp on this topic without a pretest on same scale as outcome? • Who would design it with these controls? • Instead select controls from one or more matched schools on prior achievement history • Again, a good experiment is being compared to a bad quasi-experiment • Who would treat this as 11 separate experiments vs. a more stable pooled experiment? Even the authors, pooled results are much more congruent.
Hypothesis is that... • The job training and educational examples that produce different conclusions from the experiment are examples of poor quasi-experimental design • To compare good exp to poor quasi-exp is to confound a design type and the quality of its implementation—a logical fallacy • But I reach this conclusion ex post facto and knowing the randomized experimental results in advance
Big Conclusions: • R-D has given results not much different from experiment in three of three cases. • Simpler Quasi-Experiments tend to give same results as experiment if: (a) population matching in the sampling design—Bloom and Aiken studies, or if (b) careful conceptualization and measurement of selection model, as in Shadish et.
What I am not Concluding: • That well designed quasi-experiment is as good as an experiment. Difference in: • Number and transparency of assumptions • Statistical power • Knowledge of implementation • Social and political acceptance • If you have the option, do an experiment because you can rarely put right by statistics what you have messed up by design
What I am suggesting you consider: • Whether this be a unit on RCTs or quality causal studies • Whether you want to do RDD studies in cases where an experiment is not possible because resources are distributed otherwise • Whether you want to do quasi-experiments if group matching on the pretest is possible, as in many school-level interventions?
More Contentiously if: • The selection process can be conceptualized, observed and measured very well. • An abbreviated ITS analysis is possible, as in Bloom et al. • The instinct to avoid quasi-experiments is correct, but it reduces the scope of the causal issues that can be examined
Results-Aiken et al • pretest values on SAT/CAT, 2 writing measures • Measurement framework the same • Pretest ACTs and writing - ns exp vs non • OLS tests • Results for writing test = .59 and .57 - sig • Results for essay = .06 and .16 - ns
Bloom et al Revisited • Analysis at the individual level • Within city, within welfare to work center, same measurement design • Absolute bias- yes • Average bias none across 5 within-state sites, even w/o stat tests • Average bias limited to small site and non-within-city site-Detroit vs Grand Rapids
Correspondence Criteria • Random error and no exact agreement • Shared stat sig pattern from zero - 68% • Two ESs not statistically different • “Comparable” magnitude estimates • One as percent of other • Indulgence, common sense and mix
Our Research Issues • Deconstructing “non-experiment”--do experimental and non-experimental ESs correspond differently for R-D, for ITS, and for simple non-equivalent designs? • How far can we generalize results about invalidity of non-experiments beyond job training? • Do these within-study comparison studies bear the weight ascribed to them in evaluation policy at DoL and IES?