“Value added” measures of teacher quality: use and policy validity

“Value added” measures of teacher quality: use and policy validity Sean P. Corcoran New York University NYU Abu Dhabi ConferenceJanuary 22, 2009

Overview An introduction to the use of “value added” measures (VAM) of teacher effectiveness – in both research and practice. A discussion of the policy validity of VAM – motivated by current work on “teacher effects” on multiple assessments of similar skills. With: Jennifer L. Jennings (Columbia U) Andrew A. Beveridge (Queens College)

What are “value added” measures? Essentially, an indirect estimate of a teacher’s contribution to learning, measured using gains in students’ standardized test score results What makes them “indirect?” Uses a statistical model to account for certain student characteristics (key: past achievement), attributing remaining test score gains to the teacher Clearly an improvement over test score levels

What are “value added” measures? Generally, “teacher effects” cannot be separated from “classroom effects” E.g. two classrooms of similarly situated students where one has a particularly disruptive student May be able to improve VAM with multiple years of results for teachers This approach raises a range of additional issues and questions, some of which I will address in a moment

Growth in VAM VAM of teacher effectiveness were initially mostly of academic interest Rivkin et al. (2005): effect size of .10/.11 SD for reading/math Nye et al. (2004): 25-75th percentile shift in teacher quality increased reading/math by .35/.48 SD

Growth in VAM Value added assessment of teachers is becoming widespread practice in the U.S. Houston, Dallas, Denver, Minneapolis, Charlotte EVASS New York City – for now a “development tool” only The Teacher Data Tool Kit

Why the sudden interest? A logical extension of school accountability Movement to collect, publicly report student achievement measures at the school level In some cases, rewards and sanctions (e.g. NCLB) Common sense appeal (both Obama and McCain supported “pay for performance” for teachers)

Why the sudden interest? Data availability Large longitudinal databases of student performance enabled these calculations Concurrent advancements in methodology

Why the sudden interest? Improving our assessment and measurement of teacher quality Easily observed characteristics of teachers are often poor predictors of classroom achievement (Hanushek and Rivkin 2006) Especially true of qualifications for which teachers are remunerated (e.g. education, certification, experience)

Issues with VAM (to name a few…) Focus on a narrow measure of educational outcomes: does “the test” adequately reflect our expectations of the educational system? E.g. skill content, short-term vs. long-term benefits Validity: assuming “the test” reflects outcomes we care about, is the instrument a valid one? Teaching to the test and test inflation (Koretz 2007) – even “good” tests lose validity over time

Issues with VAM (to name a few…) Modeling for causal inference: how can we be confident that our VAM are providing “good” estimates of the teachers true (i.e. causal) contribution to student learning? Students are not randomly assigned to teachers Dynamic tracking “Teacher effects” may be context dependent

Issues with VAM (to name a few…) Precision Estimates of teacher effects are just that: estimates Each student’s test score gain is a small—and noisy—indicator of teacher effectiveness Are our estimates precise enough to base personnel decisions on them?

Issues with VAM (to name a few…) Other Perverse incentives (gaming / cheating) Subject dependency Persistence Scaling issues – e.g. ceiling effects Missing data – e.g. absent or exempted students

The “policy validity” of VAM Do VAM of teacher effectiveness have “policy validity?” That is, are they appropriate for practical implementation, and for what purposes? (Harris 2007) If one were to make personnel decisions based on VAM, at the very least these measures should be: Convincing as “causal” estimates Relatively precise

Our research question If VAM are meaningful indicators of teacher effectiveness, they should be relatively consistent across alternative assessments of the same skills (especially for narrowly defined skills) In most cases we only observe one assessment – the “high stakes” state assessment – upon which teacher effects are estimated

Houston Houston is somewhat unique in that one can observe two measures of student achievement: TAKS – a “high stakes” exam Stanford 10 – a “low stakes” exam Both test reading and math skills How consistent are VAM of effectiveness on these two tests?

Houston data and method Longitudinal student-level data on all students in the Houston ISD, 1998 – 2006 (we use 2003-06) Students are linked to their teachers Student background About 127,000 students We estimate teacher effects for 4th and 5th grade teachers on both TAKS and Stanford tests Using 1 and 3 years of results

Correlation across tests

Teacher effects on multiple tests

Teacher effects on multiple tests (one year of data only)

Teacher effects on multiple subjects

Teacher effect stability

Conclusions Teachers who are good at promoting growth on a high-stakes test are not necessarily those who are good at promoting growth on a low-stakes tests of the same subject. Teacher effects vary significantly across years and subjects Useful for policy? Probably—but we should resist relying too heavily on these measures Of course, more research is needed!

“Value added” measures of teacher quality: use and policy validity