Challenges in Conducting Reliable Education Meta-Analysis

Dylan Wiliam, UCL (@dylanwiliam) Why meta-analysis is reallyhard to do well in education www.dylanwiliamcenter.comwww.dylanwiliam.org

Approaches to research synthesis Idealist Realist Philosophy Generate Explore Test Relation to theory Configurating Aggregating Approach to synthesis Iterative A priori Methods Quality assessment Theoretical search Exhaustive search Value contribution Avoid bias Emergent concepts Magnitude/precision Product Enlightenment Instrumental Use Gough (2012)

Systematic reviews “A systematic review attempts to collate all empirical evidence that fits pre-specified eligibility criteria in order to answer a specific research question. It uses explicit, systematic methods that are selected with a view to minimizing bias, thus providing more reliable findings from which conclusions can be drawn and decisions made” (p. 6) Green, Higgins, Alderson, Clarke, Mulrow, and Oxman (2008)

Meta-analysis “Many systematic reviews contain meta-analyses. Meta-analysis is the use of statistical methods to summarize the results of independent studies (Glass 1976). By combining information from all relevant studies, meta-analyses can provide more precise estimates of the effects of health care than those derived from the individual studies included within a review […] Meta-analyses facilitate investigations of the consistency of evidence across studies, and the exploration of differences across studies” Green, Higgins, Alderson, Clarke, Mulrow, and Oxman (2008)

Key characteristics of systematic reviews • a clearly stated set of objectives with pre-defined eligibility criteria for studies; • an explicit, reproducible methodology; • a systematic search that attempts to identify all studies that would meet the eligibility criteria; • an assessment of the validity of the findings of the included studies, for example through the assessment of risk of bias; and • a systematic presentation, and synthesis, of the characteristics and findings of the included studies; Green, Higgins, Alderson, Clarke, Mulrow, and Oxman (2008)

Problems with meta-analysis in education • Inappropriate comparisons • Aptitude x treatment interaction • The “file drawer” problem • Variations in intervention quality • Variation in population variability • Selection of studies • Sensitivity of outcome measures

Inappropriate comparisons

Inappropriate comparisons • Effects of interventions or associations? • Cross-level comparisons • Net effects versus gross effects

Inappropriate comparisons • Effects of interventions or associations? • Cross-level comparisons • Net effects versus gross effects • “Business-as-usual” vs. alternative treatment

Aptitude x treatment interactions

Aptitude-treatment interaction • 113 non-formal education centres run by SevaMandir • In 56 centres, teachers were paid Rs.1,000 pcm • In 57 centres, teachers were paid • Rs.500 pcm for attendance up to 10 days plus • Rs.50 for each each day over the 10 day threshold • Attendance rate: • Fixed pay group 58% • Incentive group 79% • For the incentive group • Increase in instructional time: 32% • Increase in annual progress: 25% Duflo, Hanna, and Ryan(2012)

The file-drawer problem

The importance of statistical power • The statistical power of an experiment is the probability that the experiment will yield an effect that is large enough to be statistically significant. • In single-level designs, power depends on • significance level set • magnitude of effect • size of experiment • The power of most social studies experiments is low • Psychology: 0.4 (Sedlmeier & Gigerenzer, 1989) • Neuroscience: 0.2 (Burton et al., 2013) • Education: 0.4 • Only lucky experiments get published…

Statistical power and effect size

Variation in intervention quality

Quality • Interventions vary in their • Duration • Intensity • class size reduction by 20%, 30%, or 50% • response to intervention • Collateral effects • assignment of teachers

Variation in variability

“It is also known, as an empirical—not definitional—fact that the standard deviation of most achievement tests in elementary school is 1.0 grade-equivalent units; hence the effect size of one year’s instruction at the elementary school level is about +1” (Glass, McGaw, & Smith, 1981 p. 103)

Annual growth in achievement, by age Bloom, Hill, Black, andLipsey(2008)

Sequential Tests of Educational Progress Educational Testing Service (1957)

Annual achievement growth in Connecticut Wibowo, Hendrawan, and Deville (2009)

Variation in variability Studies with younger children will produce larger effect size estimates Studies with restricted populations (e.g., children with special needs, gifted students) will generally produce larger effect size estimates

Selection of studies

Feedback in STEM subjects Ruiz-Primo and Li (2013) • Review of 9000 papers on feedback in mathematics, science and technology • Only 238 papers retained • Background papers 24 • Descriptive papers 79 • Qualitative papers 24 • Quantitative papers 111 • Mathematics 60 • Science 35 • Technology 16

Classification of feedback studies Who provided the feedback (teacher, peer, self, or technology-based)? How was the feedback delivered (individual, small group, or whole class)? What was the role of the student in the feedback (provider or receiver)? What was the focus of the feedback (e.g., product, process, self-regulation for cognitive feedback; or goal orientation, self-efficacy for affective feedback) On what was the feedback based (student product or process)? What type of feedback was provided (evaluative, descriptive, or holistic)? How was feedback provided or presented (written, video, oral, or video)? What was the referent of feedback (self, others, or mastery criteria)? How, and how often was feedback given in the study (one time or multiple times; with or without pedagogical use)?

Main findings

Sensitivity to instruction

Sensitivity of outcome measures • Distance of assessment from the curriculum • Immediate • e.g., science journals, notebooks, and classroom tests • Close • e.g., where an immediate assessment asked about number of pendulum swings in 15 seconds, a close assessment asks about the time taken for 10 swings • Proximal • e.g., if an immediate assessment asked students to construct boats out of paper cups, the proximal assessment would ask for an explanation of what makes bottles float • Distal • e.g., where the assessment task is sampled from a different domain and where the problem, procedures, materials and measurement methods differed from those used in the original activities • Remote • standardized national achievement tests. Ruiz-Primo, Shavelson, Hamilton, and Klein (2002)

Impact of sensitivity to instruction Effect size Close Proximal

Meta-analysis in education • Some problems are unavoidable: • Aptitude x treatment interactions • Sensitivity to instruction • Selection of studies • Some problems are avoidable: • Inappropriate comparisons • File-drawer problems • Intervention quality • Variation in variability • Unfortunately, many of the people doing meta-analysis in education: • don’t discuss the unavoidable problems, and • don’t avoid the avoidable ones

Responses • The effects average out • The rank order of effects is still OK

More significant challenges • “Tales not told” (Kvernbekk, 2019) • Evidence about “What worked” not “What works” • Finding conditions for the use of standardized effect size that are both justifiable and useful

In the meantime… • Educators need to become “critical consumers” of educational research • Four questions • Does this solve a problem we have? • How much improvement will we get? • How much will it cost? • Will it work here?

Thank You www.dylanwiliam.net

Challenges in Conducting Reliable Education Meta-Analysis

Challenges in Conducting Reliable Education Meta-Analysis

Presentation Transcript

Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam dylanwiliam

Dylan Wiliam , NCME 2019 April: Toronto, Canada @dylanwiliam

Principled curriculum and assessment design: Tools for schools Dylan Wiliam

Dylan Wiliam

DYLAN

The reliability of educational assessments Dylan Wiliam dylanwiliam

Dylan Wiliam dylanwiliam

Dylan versus Dylan

Formative e-assessment: some theoretical resources Dylan Wiliam dylanwiliam

When is assessment learning-oriented? Dylan Wiliam dylanwiliam

Dylan Wiliam ETS dwiliam@ets

The reliability of educational assessments Dylan Wiliam dylanwiliam

Dylan Wiliam dylanwiliam

Dylan Wiliam Deputy Director Institute of Education, University of London

When is assessment learning-oriented? Dylan Wiliam dylanwiliam

What kinds of assessments improve learning? Dylan Wiliam, ETS

dylanwiliam

Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam dylanwiliam