440 likes | 558 Views
Effect Sizes in Education Research: What They Are, What They Mean, and Why They’re Important. Howard Bloom (MDRC; Howard.Bloom2@mdrc.org) Carolyn Hill (Georgetown; cjh34@georgetown.edu) Alison Rebeck Black (MDRC; alison.black@mdrc.org) Mark Lipsey (Vanderbilt; mark.lipsey@vanderbilt.edu).
E N D
Effect Sizes in Education Research: What They Are, What They Mean, and Why They’re Important Howard Bloom (MDRC; Howard.Bloom2@mdrc.org) Carolyn Hill (Georgetown; cjh34@georgetown.edu) Alison Rebeck Black (MDRC; alison.black@mdrc.org) Mark Lipsey (Vanderbilt; mark.lipsey@vanderbilt.edu) Institute of Education Sciences 2006 Research Conference Washington DC
Today’s Session • Goal: introduce key concepts and issues • Approach: focus on nexus between analytics and interpretation • Agenda • Core concepts • Empirical benchmarks • Important applications
Part 1: The Nature (and Pitfalls) of the Effect Size Howard Bloom MDRC
Starting Point • Statistical significance vs. substantive importance • Effect size measures for continuous outcomes (our focus) • Effect size measures for discrete outcomes
Variance components framework Decomposing the total national variance
Career Academies andFuture Earnings for Young Men Impact on Earnings Dollars per month increase $212 Percentage increase 18 % Effect size 0.30s
Rate of Heart Attacks With placebo 1.71 % With aspirin 0.94 % Difference 0.77 % Effect Size0.06 s Aspirin and heart attacks Measures of Effect Size,” in Harris Cooper and Larry V. Hedges, The Handbook of Research Synthesis (New York: Russell Sage Foundation)
Five-year impacts of the Tennessee class-size experiment Treatment: 13-17 versus 22-26 students per class Effect sizes: 0.11s to 0.22s for reading and math Findings were summarized from Nye, Barbara, Larry V. Hedges and Spyros Konstantopoulos (1999) “The Long-Term Effects of Small Classes: A Five-Year Follow-up of the Tennessee Class Size Experiment,” Educational Evaluation and Policy Analysis, Vol. 21, No. 2: 127-142.
Part 2: What’s a Big Effect Size, and How to Tell? Carolyn Hill, Georgetown University Alison Rebeck Black, MDRC
Need to interpret an effect size when: Designing an intervention study Interpreting an intervention study Synthesizing intervention studies To assess practical significance of an effect size: Compare to external criterion/standard Related to outcome construct Related to context How Big is the Effect?
Cohen (speculative) Small = 0.20 s Medium = 0.50 s Large = 0.80 s Cohen, Jacob (1988) Statistical Power Analysis for the Behavioral Sciences 2nd edition (Hillsdale, NJ: Lawrence Erlbaum). Lipsey (empirical) Small = 0.15 s Medium = 0.45 s Large = 0.90 s Lipsey, Mark W. (1990) Design Sensitivity: Statistical Power for Experimental Research (Newbury Park, CA: Sage Publications). Prevailing Practice for Interpreting Effect Size: “Rules of Thumb”
Preferred Approaches for Assessing Effect Size (K-12) • Compare ES from the study with: • ES distributions from similar studies • Student attainment of performance criterion without intervention • Normative expectations for change • Subgroup performance gaps • School performance gaps
Percentile 50th 25th 75th 5th 95th -0.06 0.07 0.16 0.25 0.39 Effect Size (σ) ES Distribution from Similar Studies Percentile distribution of 145 achievement effect sizes from meta-analysis of comprehensive school reform studies (Borman et al. 2003):
Normative Expectations for Change:Estimating Annual Reading and Math Gains in Effect Size from National Norming Samples for Standardized Tests • Seven tests were used for reading and six tests were used for math • The mean and standard deviation of scale scores for each grade were obtained from test manuals • The standardized mean difference across succeeding grades was computed • These results were averaged across tests and weighted according to Hedges (1982)
Annual Reading and Math Growth Reading Math Grade Growth Growth Transition Effect Size Effect Size --------------------------------------------------------------- K - 1 1.59s 1.13s 1 - 2 0.94 1.02 2 - 3 0.57 0.83 3 - 4 0.37 0.50 4 - 5 0.40 0.59 5 - 6 0.35 0.41 6 - 7 0.21 0.30 7 - 8 0.25 0.32 8 - 9 0.26 0.19 9 - 10 0.20 0.22 10 - 11 0.21 0.15 11 - 12 0.03 0.00 ---------------------------------------------------------------------------------------- Based on work in progress using documentation on the national norming samples for the CAT5, SAT9, Terra Nova CTBS, Gates MacGinitie, MAT8, Terra Nova CAT, and SAT10.
Demographic Performance Gaps from Selected Tests • Interventions may aim to close demographic performance gaps • Effectiveness of interventions can be judged relative to the size of gaps they are designed to close • Effect size gaps vary across grades, years, tests, and districts
Performance Gaps between “Average” and “Weak” Schools • Main idea: • What is the performance gap (effect size) for the same types of students in different schools? • Approach: • Estimate a regression model that controls for student characteristics: race/ethnicity, prior achievement, gender, overage for grade, and free lunch status. • Infer performance gap (effect size) between schools at different percentiles of the performance distribution
Interpreting the Magnitude of Effect Sizes • “One size” does not fit all • Instead, interpret magnitudes of effects in context • Of the interventions being studied • Of the outcomes being measured • Of the samples/subsamples being examined • Consider different frames of reference in context, instead of a universal standard: • ES distributions, external performance criteria, normative change, subgroup/school gaps, etc.
Part 3: Using Effect Sizes in Power Analysis and Research Synthesis Mark W. Lipsey Vanderbilt University
Statistical Power • The probability that a true intervention effect will be found statistically significant.
Estimating Statistical Power Prospectively: Finding the MDE Specify: • alpha level– conventionally .05 • sample size (at all levels if multilevel design) • correlation between any covariates to be used and dependent variable • intracluster correlation coefficients (ICCs) if multilevel design • target power level– conventionally set at .80 Estimate: minimum detectable effect size
Assessing the MDE • Compare with a target effect size-- the smallest ES judged to have practical significance in the intervention context • Design is underpowered if MDE > target (back to the drawing board) • Design is adequately powered if MDE ≤ target value
Where Do You Get the Target Value for Practical Significance? • NOT some broad rule of thumb, e.g, Cohen’s “small,” “medium,” and “large” • Use a frame of reference appropriate to the outcome, population, and intervention • meaningful success criterion • research findings for similar interventions • change expected without intervention • gaps between relevant comparison groups • et cetera
Selecting the Target MDE • Identify one or more reference frames that may be applicable to the intervention circumstances • Use that frame to guide selection of an MDE; involve other stakeholders • Use different reference frames to consider: • which is most applicable to the context • how sensitive the choice is to the frames • what the most conservative selection might be
Power for Different Target MDEs(2-level design: students in classrooms) ES=.80 .80 ES=.50 ICC=.15 ES=.20 Number of Classrooms of N=20
Power for Different Target MDEs(same with classroom covariate R2 =.50) ES=.80 .80 ES=.50 ES=.20 ICC=.15 Number of Classrooms of N=20
Interpreting Effect Sizes Found in Individual Studies & Meta-Analysis • The practical significance of empirically observed effect sizes should be interpreted using approaches like those described here • This is especially important when disseminating research results to practitioners and policymakers • For standardized achievement measures, the practical significance of ES values will vary by student population and grade.
Example: Computer-Assisted Instruction for Beginning Reading (Grades 1-4) Consider an MDE = .25 • Mean ES=.25 found in Blok et al 2002 meta-analysis • 27-65% increase over “normal” year-to-year growth depending on age • About 30% of the Grade 4 majority-minority achievement gap
References Bloom, Howard S. 2005. “Randomizing Groups to Evaluate Place-Based Programs.” In Howard S. Bloom, editor. Learning More from Social Experiments: Evolving Analytic Approaches. New York: Russell Sage Foundation, pp. 115-172. Bloom, Howard S. 1995. “Minimum Detectable Effects: A Simple Way to Report the Statistical Power of Experimental Designs.” Evaluation Review 19(5): 547-56. Borman, Geoffrey D., Gina M. Hewes, Laura T. Overman, and Shelly Brown. 2003. “Comprehensive School Reform and Achievement: A Meta-Analysis.” Review of Educational Research 73(2): 125-230. Hedges, Larry V. 1982. “Estimation of Effect Size from a Series of Independent Experiments.” Psychological Bulletin 92(2): 490-499. Kane, Thomas J. 2004. “The Impact of After-School Programs: Interpreting the Results of Four Recent Evaluations.” William T. Grant Foundation Working Paper, January 16. http://www.wtgrantfoundation.org/usr_doc/After-school_paper.pdf Konstantopoulos, Spyros, and Larry V. Hedges. 2005. “How Large an Effect Can We Expect from School Reforms?” Working paper #05-04, Institute for Policy Research, Northwestern University. http://www.northwestern.edu/ipr/publications/papers/2005/WP-05-04.pdf. Lipsey, Mark W. 1990. Design Sensitivity: Statistical Power for Experimental Research. Thousand Oaks, CA: Sage Publications. Schochet, Peter Z. 2005. “Statistical Power for Random Assignment Evaluations of Education Programs.” Project report submitted by Mathematic Policy Research, Inc. to Institute of Education Sciences, U.S. Department of Education. http://www.mathematica-mpr.com/publications/PDFs/statisticalpower.pdf
Contact Information Howard Bloom (Howard.Bloom2@mdrc.org) Carolyn Hill (cjh34@georgetown.edu) Alison Rebeck Black (alison.black@mdrc.org) Mark Lipsey (mark.lipsey@vanderbilt.edu)