Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models

Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models Andrew Ho Harvard Graduate School of Education Maryland Assessment Research Center for Education Success (MARCES) Assessment Conference: Value Added Modeling and Growth Modeling with Particular Application to Teacher and School Effectiveness College Park , Maryland, October 18, 2012

What makes a good growth model? • How can we advance from passing judgment on schools and teachers to facilitating their improvement? • By which criteria should we evaluate accountability models? • Predictive accuracy of individual student “growth” models for school-level accountability. • Projection Model • Trajectory Model • Conditional Status Percentile Rank models (e.g., SGPs) • Incentives. • Conditional Incentive diagrams and alignment to policy goals. • Transparency, black boxes, and score reporting.

Context • School accountability metrics that count students who are “on track” to proficiency, career and college readiness, or some other future outcome. • A seemingly straightforward criterion is minimization of the distance between predicted and actual future performance.

How “predictive accuracy” is like “standards” • Once the discussion is framed in terms of standards, the only rhetorically acceptable choice is high standards. • Once the discussion is framed in terms of predictive accuracy, the only rhetorically acceptable choice is maximal accuracy.

A simple projection model • Define as a test score in Grade . • Consider the problem of predicting a “future” Grade 8 test score from a current Grade 7 score and a past score . • Take data from a past cohort with complete Grade 6, 7, and 8 data. • Estimate a simple prediction equation: • Assume the equation holds for the current cohort. • Plug in and data from the current cohort into the equation estimated from the past cohort (often looks like this for standardized data):

A simple projection model

Minimize Squared Error • Define a criterion that describes the average “miss” on the scale (lower the better): • As you might expect of a regression model given only prior-year variables, the projection model does about as well as possible with prediction, with RMSEs between 0.4 and 0.6 standard deviation units. • A convenient representation of RMSE assumes equal intergrade correlations (this is unrealistic but tolerable for approximation) for prior years (in this case, ).

A simple trajectory model • Requires an argument for a vertical scale. • Extends past gains into the future. • Compare with a typical projection model. • Under the caricatured conditions of equal intergrade correlations, , and prior years, an “average gain” trajectory model has • The ratio is 1.4 to 2 over common scenarios.

A simple trajectory model

Projections from Conditional Status Percentile Ranks • Castellano and Ho (2012) describe metrics like SGPs (Betebenner, 2008) and Percentile Ranks of OLS Residuals (PRRs) as Conditional Status Percentile Ranks. • These can be used to make predictions as follows: • Regress “current” Grade 7 on “past” Grade 6: • Define as the percentile rank of a student’s residual in the distribution of . • Obtain following the projection model. • For each student, add , the th percentile residual from , where corresponds with their from .

Contrasting all predictive models

An important contrast in “growth” use and interpretation Growth Description Growth Prediction Gain Scores, Trajectories Status Beyond Prediction, CSPR Trajectory, Gain-Score Model Projection/Regression Model Where a student was, where a student is, and what has been learned in between. Where a student is, above and beyond where we would have predicted she would be, given past scores. Extend past gains in systematic fashion into the future. Consider whether future performance is adequate Use a regression model to predict future scores from past scores, statistically, empirically.

Two Approaches to Growth Description Gain Scores, Trajectories Status Beyond Prediction Status beyond prediction Gain Prediction from previous score (or scores, or scores and demographics) Adding two different students with equal status beyond predictions. Adding two students with equal gains

Two Approaches to Growth Prediction Trajectory Model Projection/Regression Model • Extends gains over time in straightforward fashion. • With more prior years, a best-fit line or curve can be extended similarly. • Extended trajectories do not have to be linear. • Estimates a prediction equation for the “future” score. • Because current students have unknown future scores, estimate the prediction equation from a previous cohort that does have their “future” year’s score. • Input current cohort data into this prediction equation.

Stark Contrasts in Projections Three students with equal projections from a gain-score model The same three students’ predictions with a regression model. Three students with equal projections from a regression model.

Models by RMSE • As noted, the ratio ranges from 1.4 to 2 under common conditions. • The ratio (Castellano & Ho, in preparation) • Compared to a projection model (regression) baseline, CSPRs have an average “miss” that is 40% greater, and trajectory models are worse still. • Regression does well what regression does well. • Case closed?

Models by RMSE • What sorts of decisions do these models support? • How do these decisions set incentives for teachers and school administrators? • Consider a future standard, , where if , one is deemed to be “on track.” • Given , , , and any of our three models, we can graph the “on track” boundary on an vs. plane (Ho, 2011; Ho, et al., 2009; Hoffer, et al., 2011). • For example, we can plot , assuming a 50th percentile cut score () for simplicity:

Decision Plots For a Grade 6 score of -2, projection models require a 1.6, CSPRs require a -.5, and trajectory models require a -1. For a Grade 6 score of 2, projection models require anything above a -1.6, CSPRs require a .5, and trajectory models require a 1.

Torquing Projection Lines The Projection line (and the CSPR line) is empirically derived. When does the projection line start looking more like the Trajectory line? As an illustration, the projection line slope is 0 when: Adjacent grade correlations generally range from 0.6 to 0.8. Distal grade correlations are rarely that much lower, so this is an uncommon occurrence.

Aspirational Models vs. Predictive Models The CSPR and trajectory model lines are, from this perspective, more aspirational than predictive. They envision a covariance structure where relative position is less fixed over time than it is empirically. Can this be okay, even if it decreases predictive accuracy?

From Decision Plots to Effort Plots The regression line or “conditional expectation” line gives us a baseline expectation given our Grade 6 scores. Anything above this line may require “effort.” We can plot this effort on prior-year scores by subtracting out this regression line.

Conditional Effort Plots Required gain beyond expectation. Maximizing predictive accuracy may lead to implausible gains required to get low-achieving students to be “on track.” A low score as a “ball and chain.” A high score as a “free pass.”

Conditional Incentive Plots In a zero-sum model for incentives to teach certain students, these conditional effort plots imply conditional incentive plots, as shown. The question may be, what is the goal of the policy? This informs conditional incentive plots, and these can inform model selection. This is a useful alternative to letting prediction drive model selection and then being surprised by the shape of incentives.

Stark Contrasts in Incentives Trajectory, Gain-Score Model Regression/Prediction Model • Lower initial scores can inflate trajectories: • New Model Rewards Low Scores, Encourages “Fail-First Strategy” • Very intuitive, requires vertical scales, less accurate in terms of future classifications. • Low scorers require huge gains. High scorers can fall comfortably. • New Model Labels High and Low Achievers Early, Permanently. • Counterintuitive, does not require vertical scales, more accurate classifications.

Conditional Incentive Plots for VAMs • I argue that school accountability metrics should be designed with less attention to “standards” and “prediction” and closer attention to conditional incentives and their alignment with policy goals. • What about teacher accountability metrics? • Conditional incentive plots for VAMs are generally uniform across the distributions of variables included in the model. • Scaling anomalies may lead to distortions, if equal intervals do not correspond with equal “effort for gains,” although this is difficult to game.

Transparency and Score Reporting • As accountability calculations become increasingly complicated, score reporting and transparency become even more necessary mechanisms for the improvement of schools and teaching. • Systems will be more successful with clear reporting of actionable (and presumably defensible) results. • An example: I used to be very suspicious of categorical/value-table models, as they create pseudo-vertical scales and sacrifice information. • I still have reservations, but they are still excellent tools for reporting and communicating results, even when the underlying models are not themselves categorical. • An actionable, interpretable categorical framework layered over a continuous model.

Communicating Incentives Clearly • If I had a choice between • A simple VAM that communicated actionable responses and incentives clearly, vs. • A complicated VAM that did not have any guidance about how to improve… • This is a false dichotomy. • We can make the complex seem simple. • Conditional incentive plots and similar attention to differential student contribution to VAM estimates are one approach to this. • Both to anticipate gaming behavior and encourage desired responses.

Haertel (2012) Measurement vs. Influence • In his NCME Career Award address, Haertel distinguished between two categories of purposes of large-scale testing: Measurement and Influence. • A “influencing” purpose is often depends less on the results of the test itself. • Directing student effort • “Shaping public perception” • Validation arguments for Influencing purposes are rarely well described. • These plots are a modest first step for visualizing the Influencing mechanisms of proposed models.

School/Teacher Effectiveness and House, MD • A medical analogy (thanks to Catherine McClellan) can be helpful in thinking about where VAM and school accountability research should continue to go. • Doctors must gather data, identify symptoms, reach a diagnosis, and prescribe a treatment. • In school and teacher effectiveness conversations, we often get stuck at “symptoms.” • Doctors do not average blood pressure results with fMRI results to get increasingly reliable and accurate measures of “health.” Or at least they don’t stop there. • We need to continue advancing the science of diagnosis (what’s wrong) and treatment (now what). • We must continue beyond predictive accuracy and even conditional incentives to deeper understanding of teachers’ and administrators’ learning in response to evaluation systems.

Accuracy, Transparency, and Incentives: Contrasting Criteria for Evaluating Growth Models