350 likes | 365 Views
This paper discusses the limitations of using p-values to understand the effects of education interventions and proposes alternative approaches for meaningful characterization.
E N D
Beyond p-Values: Characterizing Education Intervention Effects in Meaningful Ways Mark Lipsey, Kelly Puzio, Cathy Yun, Michael Hebert, Kasia Steinka-Fry, Mikel Cole, Megan Roberts, Karen Anthony, Matthew Busick Vanderbilt University And also Howard Bloom, Carolyn Hill, & Alison Black IES Research Conference Washington, DC June 2010
Intervention research model Compare treatment (T) sample with control (C) sample on education outcome measure Description of the intervention effect that results from this comparison: Means on outcome measure for T and C samples; difference between means p-values for statistical significance of the difference between
Problem to be addressed The native statistical findings that represent the effect of an intervention on an education outcome often provide little insight into the nature, magnitude, or practical significance of the effect Practitioners, policymakers, and even researchers have difficulty knowing whether the effects are meaningful
Example Intervention: vocabulary-building program Samples: fifth graders receiving (T) and not receiving (C) the program Outcome: CAT5 reading achievement test Mean score for T: 718 Mean score for C: 703 Difference between T and C means: 15 points p-value: <.05 [! Note– not an indicator of magnitude of effect!] Questions: Is this a big effect or a trivial one? Do the students read a lot better now, or just a little better? If they were poor readers before, is this a big enough effect to now make them proficient readers? If they were behind their peers, have they now caught up? Someone intimately familiar with the CAT5 scoring may be able to look at the means and answer such questions, but most of us haven’t a clue.
Two approaches to review here Descriptive representations of intervention effects: Translations of the native statistical results into forms that are more readily understood Practical significance: Assessing the magnitude of intervention effects in relationship to criteria that have recognized value in the context of application
Representation in terms of the original metric Often inherently meaningful, e.g.: proportion of days student was absent number of suspensions or expulsions proportion of assignments competed Covariate adjusted means (baseline diffs; attrition) Pretest baselines and differential pre-post change (example on next slide)
Fuller picture with pretest baseline middle school students, conflict resolution, interpersonal aggression surveys at the beginning and end of the school year– self-report interpersonal aggression Pre-Post Change Differentials that Result in the Same Posttest Difference
Fuller picture with pretest baseline middle school students, conflict resolution, interpersonal aggression surveys at the beginning and end of the school year– self-report interpersonal aggression Pre-Post Change Differentials that Result in the Same Posttest Difference
Effect size Typically the standardized mean difference ES ESd=Δ/σ
Utility of effect size Useful for comparing effects across studies with ‘same’ outcome measured differently Somewhat meaningful to researchers But not very intuitive; provides little insight into nature and magnitude of effect, esp for nonresearchers Often reported in relation to Cohen’s guidelines for ‘small,’ ‘medium,’ and large BAD IDEA
Notes and quirks about ESs Better with covariate adjusted means Don’t adjust variance/SD– concept of standardization Issue of the variance on which to standardize Effect sizes standardized on variance/SD other than between individuals Effect size from multilevel analysis results
Proportions of T and C samples above or below a threshold score
Cohen U3 overlap index .73 σ 50% above C mean 77% above C mean Adapted from Redfield & Rousseau, 1981
Rosenthal & Rubin BESD d = .80
Options for threshold values Mean of control sample (U3) Grand mean of combined T and C samples (BESD) Predefined performance threshold (e.g., NAEP) Other possibilities: Mean of norming sample, e.g., standard score of 100 on PPVT Mean of reference group with ‘gap,’ e.g., students who don’t qualify for FRPL, majority students Study determined threshold, e.g., score at which teachers see behavior as problematic Target value, e.g., achievement gain needed for AYP Any other identifiable score on the measure that has interpretable meaning within the context of the intervention study
Conversion to grade equivalent (and age equivalent) scores Mean Reading Grade Equivalent (GE) Scores of Success for All (SFA) and Control Samples [from Slavin et al., 1996]
Characteristics and quirks of grade equivalent scores Provided (or not) by test developer [Note: could be developed by researcher for context of intervention study] Vary from X.0 to X.9 over 9 month school year Not criterion-referenced; estimates from empirical norming sample Imputed where norming data are thin, esp for students outside grade range Nonlinear relationship to test scores, e.g., given GE difference in early grades is larger score difference than in later grades, but greater within variation in later grades
Practical Significance: Criterion Frameworks for Assessing the Magnitude of Intervention Effects
Practical significance must be judged in reference to some external standard relevant to the intervention context • E.g., compare effect found in study with: • Effects others have found on similar measures with similar interventions • Normative expectations for change • Policy-relevant performance gaps • Intervention costs (not discussed here)
Cohen Small = 0.20 s Medium = 0.50 s Large = 0.80 s Cohen, Jacob (1988) Statistical Power Analysis for the Behavioral Sciences 2nd edition (Hillsdale, NJ: Lawrence Erlbaum). Lipsey Small = 0.15 s Medium = 0.45 s Large = 0.90 s Lipsey, Mark W. (1990) Design Sensitivity: Statistical Power for Experimental Research (Newbury Park, CA: Sage Publications). Cohen’s rules of thumb for interpreting effect size: Normative but overly broad
Effect sizes for achievement from random assignment studies of education interventions • 124 random assignment studies • 181 independent subject samples • 831 effect size estimates
Achievement effect sizes by grade level and type of achievement test
Achievement effect sizes by grade level and type of achievement test
Normative expectations for change:Estimating annual gains in effect size from national norming samples for standardized tests • Up to seven tests were used for reading, math, science, and social science • The mean and standard deviation of scale scores for each grade were obtained from test manuals • The standardized mean difference across succeeding grades was computed • These results were averaged across tests and weighted according to Hedges (1982)
Annual reading growth Reading Grade Growth Transition Effect Size ----------------------------------- K - 1 1.52 1 - 2 0.97 2 - 3 0.60 3 - 4 0.36 4 - 5 0.40 5 - 6 0.32 6 - 7 0.23 7 - 8 0.26 8 - 9 0.24 9 - 10 0.19 10 - 11 0.19 11 - 12 0.06 -------------------------------------------------- Based on work in progress using documentation on the national norming samples for the CAT5, SAT9, Terra Nova CTBS, Gates MacGinitie, MAT8, Terra Nova CAT, and SAT10.
Policy-relevant demographic performance gaps • Effectiveness of interventions can be judged relative to the sizes of existing gaps across demographic groups • Effect size gaps for groups may vary across grades, years, tests, and districts
Policy-relevant performance gaps between “average” and “weak” schools Main idea: • What is the performance gap (in effect size) for the same types of students in different schools? Approach: • Estimate a regression model that controls for student characteristics: race/ethnicity, prior achievement, gender, overage for grade, and free lunch status. • Infer performance gap (in effect size) between schools at different percentiles of the performance distribution
In conclusion … • The native statistical form for intervention effects provides little understanding of their nature or magnitude • Translating the effects into a more descriptive and intuitive form makes them easier to understand and assess for practitioners, policymakers, and researchers • There are a number of easily applied translations that could be routinely used in reporting intervention effects • The practical significance of those effects, however, requires that they be compared with some criterion meaningful in the intervention context • Assessing practical significance is more difficult but, there are a number of approaches that may be appropriate depending on the intervention and outcome construct