Threats to the Validity of Measures of Achievement Gains

Threats to the Validity of Measures of Achievement Gains Laura Hamilton and Daniel McCaffrey, RAND Corporation Daniel Koretz, Harvard University November 8, 2005

Growth Measures are Becoming More Common in State Accountability Systems • NCLB is primarily not a growth-based approach to accountability, other than through safe harbor • Many states supplement NCLB with growth-based measures • California’s Academic Performance Index • Massachusetts Performance and Improvement ratings • U.S. Department of Education has recently expressed willingness to explore growth measures

Today’s Presentation Examines Threats to Validity of Growth Measures • Background: How growth is measured • Framework for validating measures of change • Threats to validity • Dimensionality • Score inflation • Implications

Growth Metrics Come in Several Forms • Cohort to cohort (CTC) • E.g., the average for this year’s fifth graders compared to last year’s fifth graders • Quasi-longitudinal • E.g., the average for this year’s fifth graders compared to last year’s fourth graders • True longitudinal or individual growth (IG) • E.g., the average of the individual gains for this year’s fifth graders

Individual Growth Models are Generally Preferred • Address problems stemming from changes in student populations over time • Can yield biased estimates if students with incomplete data are different from other students • Provide better information to inform decisions about individual students or groups of students • CTC changes provides little information for stable schools

All Growth Models Require Assumptions about Consistency of Constructs Measured • Users of information from growth models assume construct remains constant • For CTC models, nature of achievement and test content in a single grade should not change • For IG models, nature of achievement and constructs measured should not change as students progress through school • Assumption of consistency is violated to varying degrees depending on features of models, tests, curriculum

Consistency is One Aspect of Validity • Validity applies to inferences, not just to tests • Growth modeling raises concerns about validity of inferences about change • Need to understand what users infer from change scores • These inferences might vary by group (e.g., parents, school administrators) • Match between what is inferred and what is actually measured is critical to validity

Framework for Validating Measures of Change • Validation of change scores has focused mainly on comparing trends between scores on two tests or on correlations between alternate measures • These traditional approaches do not address degree of match between tests or nonuniformity of changes within a test • Koretz, McCaffrey, and Hamilton (2001) developed a framework for validating tests under high-stakes conditions, with a focus on measuring change

Framework Addresses Nonuniformity of Gains Within a Test • Test scores and inferences are considered in terms of specific performance elements • Substantive elements represent the domain of interest • Non-substantive elements are irrelevant to the domain of interest • Performance elements are associated with weights • Weights are typically not explicit • Some may be unintentional • Validity requires close match between test weights and inference weights

A Simple Linear Model for Test Scores • If we assume performance elements are additive, the a student’s scores in year t is where qjt denotes the student’s performance on element j in year t and ljt is the test weight • The inference about a score assumes it is also a weighted sum of elements but might use different weights • Some weights can be zero

Several Factors Undermine Validity of Inferences About Change • Changing nature of sample in CTC models • Differences in characteristics of students included at different time points undermine comparability • We do not address this problem here • Dimensionality: Changes in performance elements and their weights • Score inflation: Special case of dimensionality problem stemming from increases in scores that do not match increases in achievement

Dimensionality • Tests typically assess multiple performance elements • Test specifications or maps to standards provide explicit information about performance elements • But implicit and unintended elements are also likely to affect performance • We use the term “dimensionality” broadly to cover all types of performance elements • Users’ inferences are also likely to be multidimensional • Empirical unidimensionality is not sufficient to conclude dimensionality is not a problem

Dimensionality Affects Inferences about Influences on Achievement • Analyses of NELS:88 math and science assessments examine relationships among achievement, student background, and school and classroom experiences using subscales of achievement measure • For example, gender differences in science depend on what is measured • Difference is larger on items that require out-of-school knowledge or spatial reasoning • Focus on total score or on publisher-developed test specifications masks this difference • Similar findings for relationships with other student characteristics and school experiences

Dimensionality is Relevant to Value-Added Modeling • Subscales from a single mathematics achievement test produce dramatically different results • Study used Procedures and Problem Solving subscores from the Stanford Achievement Test • Variation within teachers across subscores was as large as or larger than variation across teachers • Results suggest that decisions about teacher or school effectiveness depend strongly on outcome measure • Changes in weights given to subscores could affect estimates of teacher or school effectiveness

The Effects of Different Weightings of Computation and Problem Solving Scores on Teacher Effects

Threats Stem from Changing Performance Weights or Mismatch with Inference Weights • Many performance elements are likely to be inadvertent and non-substantive; most measures of change will not be fully aligned with users’ inferences

Sensitivity of test items to instruction is likely to vary across grades and across performance elements within the test, resulting in changing weights and/or incorrect inferences about educator effectiveness When tests measure multiple elements, weights that change over time can contribute to gain scores independent of any gains on the performance elements Threats Stem from Changing Performance Weights or Mismatch with Inference Weights

Implications for CTC and IG Models Vary • Most CTC models use the same test or parallel test forms from one year to the next • Test weights and inference weights will tend to remain reasonably constant over time • But performance elements might differ in their sensitivity to instruction • IG models face additional problem of changes in dimensionality and instructional sensitivity across grades • Problem is likely to be most severe for far-apart grade levels and for subjects in which the curriculum is not cumulative

Score Inflation • Score inflation refers to increases in test scores that are not matched by increases in the underlying achievement construct the test was intended to measure • Score inflation represents a special case of dimensionality-related problems

Score Inflation is Common in High-Stakes Testing Contexts • Analyses of high-stakes test scores show gains in those scores are not matched by gains on other tests of the same content • Discrepancies in trends on high- and low-stakes tests suggest gains on high-stakes tests do not accurately reflect gains in the underlying achievement the test was intended to measure

Example of Score Inflation Mathematics test scores Source: Koretz, Linn, Dunbar, & Shepard, 1991

Variation in Teachers’ Responses to Tests Leads to Variation in Inflation • Teachers respond to high-stakes testing in ways that are intended to maximize score increases • Placing more emphasis on tested topics than on untested topics, even when the latter are relevant to users’ inferences • Focusing on “bubble kids” (those just below the cut score) • Coaching on item styles, prompts, or rubrics (aspects of the test that are incidental to the domain being tested) • Many of these actions inflate scores by producing test-score gains that are larger than the gains in the broader achievement domain

Recent Surveys Suggest Teachers’ Practices are Influenced by Tests • Data from surveys of teachers in California, Georgia, and Pennsylvania • Most teachers report increased focus on standards and on content emphasized on tests • More than half of elementary teachers report increasing time spent on test-taking strategies • Approximately 25% of teachers say they focus more on students near the “proficient” cut score • Responses tend to be stronger in math than in science

Score Inflation Exacerbates Inconsistencies in Test and Inference Weights

Threats Stemming from Score Inflation • Problems arising from inflation are similar to those arising from dimensionality • Occurs when students make substantial gains on elements that might or might not have large inference weights, but fail to make gains on other elements that have high inference weights • Threatens the validity of inferences about gains in achievement when achievement is measured using high-stakes tests

Implications for CTC and IG Models • Most research on score inflation has focused on CTC measures • Evidence suggests score inflation is large in the first few years of test implementation but eventually plateaus • Even if inflation lessens over time, inferences about change should be limited to tested material; change scores provide no information about untested material • IG models can be affected by variation in inflation across grades; plateau effects might never occur

Improving the Validity of Inferences about Change • Users of test-score information need to recognize that measuring change is not necessarily the same as measuring growth • Test developers should make their measures as resistant to inflation as possible • Future research should address dimensionality and score inflation in the context of CTC and TL measures

Summary • Test scores and inferences depend on multiple performance elements • Valid inferences require consistency between inference and test weights • Inconsistency implies that changes in scores could be unrelated to the performance elements of interest • Score inflation • CTC susceptible to errors from growth on non-substantive or restricted set of elements • Effects likely to plateau • IG susceptible to changes in elements or content across grades • Can have big impact on growth and related measures

Threats to the Validity of Measures of Achievement Gains