Measuring and Enhancing Teacher Effectiveness: Data , Methods, and Policies

Measuring and Enhancing Teacher Effectiveness:Data, Methods, and Policies Susanna Loeb* Higher School of Economics National Research University, Moscow September 2014 *content joint with Jim Wyckoff & Allison Atteberry, Ben Master, Matt Ronfeldt or Luke Miller

Why Measure Teacher Effectiveness? • Better decisions • Direct • e.g. whom to promote • Indirect • Improved understanding • e.g. what experiences improve teacher effectiveness?

Today • A bit of history on teacher effectiveness measures in the US • Considerations of Measurement • Four examples of potential uses • focus on the last one

Large-Scale Test Data Availability • Test-Based Accountability • State Level First • TX, NC, SC, FL and others introduced yearly tests to track school performance. • Federal Level - No Child Left Behind Act • Required ELA and math tests in 3rd-8th grade plus one in high school • State and district data allowed researchers to assess policy effects and the effects of teachers • Teachers vary widely in their ability to improve student achievement(Gordon, Kane, & Staiger 2006; Rivkin, Hanushek, & Kain 2005; Sanders & Rivers 1996) • Teachers improve with experience, particularly during their first two years (e.g. Rockoff, 2004)

The Widget Effect • 2009 Study in 12 large school districts • Schools and districts • Not measuring teacher effectiveness • In districts that use binary evaluation ratings (generally “satisfactory” or “unsatisfactory”), more than 99 percent of teachers receive the satisfactory rating. • Districts that use a broader range of rating options do little better; in these districts, 94 percent of teachers receive one of the top two ratings and less than 1 percent are rated unsatisfactory. • Not considering teacher effectiveness in decisions

Push for Evaluation • Combination of • Recognition of Teacher Importance • Recognition of the Widget Effect • Lead to strong push for new evaluation systems • Not based solely on subjective assessments given the forces leading to little variation. • Speed of change probably due to Obama administration policies • close ties to entrepreneurial educators: TNTP, TFA…

Race to the Top • $4.35 Billion Competition as part of the American Recovery and Reinvestment Act of 2009 • Most points for “Great Teachers and Leaders” (138/500) • Improving teacher and principal effectiveness based on performance (58 points) • Ensuring equitable distribution of effective teachers and principals (25 points) • Providing high-quality pathways for aspiring teachers and principals (21 points) • Providing effective support to teachers and principals (20 points) • Improving the effectiveness of teacher and principal preparation programs (14 points)

Improving teacher effectiveness using performance measures • Raises Questions • How to measure effectiveness? • How to use measures of effectiveness once you have them? • What are different kinds? • Output based (e.g., based on student test performance) • Process based (e.g., based on structured observational protocol) • Holistic / Subjective (e.g., principal evaluations) • What features do we want? • Validity (measurement property) • Reliability (measurement property) • Stability (effectiveness property) • Focus today on measures based on student test scores • Similar analyses could be done with other measures

Value-Added • Measure teacher effectiveness by how much students’ test performance improve from the spring of the prior year to the spring of the current year • Idea is to isolate the teacher’s effect from other effects on learning – “value-added” • Can only be calculated for teachers in grades and subject areas for which there are tests in the prior year as well as the current year • Clearly better than using test performance levels • Far from perfect • e.g., based on imperfect tests, subject to random fluctuations and potential gaming

VAM - How are they calculated • Student test scores gains relative to what we think they would be • Most are a basic regression • Predict what a student would score in the spring based on linear function of prior score, demographic characteristics, program participation (maybe), class characteristics, school characteristics • Value added is the average differences between predicted and actual • “Colorado Growth Model” • For each student, how much do they learn relative to other students with the same prior test score (percentiles)? • Median percentile of growth for the class • Do Different Value-Added Models Tell Us the Same Things? • Models vary in how they account for student backgrounds, school, and classroom resources and whether they compare teachers across a district (or state) or just within schools. • Correlations between models are often high, but even so different models will categorize many teachers differently. (Goldhaber & Theobald, 2013)

A detailed example NYC Standard Deviations: ELA: 0.24 (.19 shrunk) Math: 0.28 (.21 shrunk)

Is VA a “Good” Measure? • Carnegie Knowledge Network • http://www.carnegieknowledgenetwork.org/ • Test score measures imperfect measure of all we care about for students • Not obvious bias (especially within schools) • Substantial measurement error • Less when considering groups of teachers • Benefits of use depend on alternatives

Understanding and Decision Making Potential uses:2 direct and 2 Indirect

Example 1: simulated usethe case of Layoffs • Several school districts confronted teacher layoffs in the Spring 2010 and 2011 • Some avoided layoffs, e.g., New York City • Others did not, e.g., LA and DC • Layoffs nearly always determined by a measure of seniority • Many superintendents raised concerns that seniority layoffs compromise teacher quality

What might we expect if substituted VA for Seniority? • Seniority layoffs typically affect teachers with two or fewer years of experience • On average teachers improve markedly during their first 3-4 years • Large variance in teacher effectiveness within and across experience • Many districts have recently focused on recruiting more able teachers

Simulate: Who is laid off by 5% Salary Savings under Seniority vs. VA? Simply simulated what would happen if 5% of the workforce had been laid off two years earlier by seniority or value-added • Fewer teachers laid off with VA layoffs: • Seniority-based layoff system would layoff 7% of teachers • VA system would terminate 5% of teachers • Little overlap • Only 13% of seniority layoffs would also be laid off by VA • VA estimates that control for experience reduces overlap to 5% • VA layoffs are, on average, 7 years more experienced than seniority layoffs

4th and 5th grade Value-Added of Layoffs by Seniority and VA

How would principals have rated laid off teachers? • 2.5% of our sample received an “Unsatisfactory” rating by their principal from 2006-09 • Of these 16% would have been VA layoffs, but only 8% of VA layoffs would have received a “U” rating • none would have been seniority layoffs

Effects on Student Learning Small effect overall since only 5% laid off, but large effects on students with the effected teachers.

Layoff Example Dismissal based on teacher performance measures likely to have less negative effects on students than dismissal based on experience In reality, given coverage and reliability concerns, value-added measures would likely be used in combination with other performance measures Availability of performance measures allowed for simulation of policy effects that could be helpful for policy decisions

Example 2: actual use the case of Promotion • Teacher Tenure: job protection most often received after 3 years • Tenure history • NJ first tenure law 1909; NY 1917; CA 1921; MI, PA WI 1937 • 48 states • Contentious then, contentious now • Policy on two tracks • Eliminate tenure • GA: eliminated 2001, reinstated 2003 • ID: passed 2011, voters repealed 2012 • SD: passed 2012, voters upheld, will eliminate by 2016 • FL: eliminated in 2011; NC: will eliminate by 2018 • Make more rigorous • More than half the states require meaningful evaluation • 20 states require student test performance • 25 states have multiple categories for evaluation

New York City tenure policy • Principal recommends, superintendent decides • Tenure decisions: approve, extend or deny • Prior to 2009-10 tenure largely automatic • Reform encouraged careful review • 2009-10 • Classroom obs, evals of teacher work products, annual S/D/U ratings • Teacher data reports (value-added measures for some teachers); in-class assessments aligned with NY standards • District guidance: “tenure in doubt”, “tenure likely”;rationale for cases that countered district guidance • 2010-11 • All teachers rated as highly effective, effective, developing, ineffective • District performance flags, but no guidance • 2011-12 • Same as before except value-added measures not available in time • 2012-13 • Same as before with State provided growth scores and growth ratings replacing local value-added measures

How did tenure rates change following reform? New tenure Policy

Which teachers were affected by the policy? Attributes of teachers by tenure decision,2010-11 to 2012-13 38% of a SD in teacher effectiveness * Value added results for only 2010-11. Extend v. Approve: p<0.05 Extend v. Deny: p<0.05

How did the composition continuing teachers change following reform? Attributes of extended teachers by attrition behavior, 2010-11 & 2011-12 Notes: ** p<0.01, * p<0.05, ~ p<0.1 – compares same school to transfer/exit

Tenure Example • Effectiveness measures used directly in practice • Reform of practice, not policy, that worked within the current contract • Imprecision is part of all evaluation measures • Here structure of reform allows for corrections

Example 3: to understand schooling, the case of Turnover, • Nationally, about 1/3 teachers leave the profession in first 5 years • Higher in high-poverty, urban, & low-performing schools (Hanushek, Kain & Rivkin, 1999) • In NYC, about 14% of 4th & 5th grade teachers leave their school each year • 4% migrate schools, 10% leave district • Is this problematic?

Background • Teacher turnover often assumed to harm student achievement…but is it? • Little empirical evidence for direct effect (Guin, 2004) • Turnover rates are higher in lower-performing schools (Guin, 2004; Hanushek et al. 1999) • Causal? A third factor explaining both (principal leaving)? • Direction? • Some turnover can be beneficial – new ideas, person-job match (Organizational management lit, e.g. Abelson & Baysinger, 1984)

Consider 2 Theories of Action • Compositional – turnover changes composition of teachers (esp. quality) which, in turn, impacts achievement • Disruption – disruptive effect beyond changes in composition of teachers • Organizational -- ALL teachers • NOT just leavers & their replacements

Methods • Unique identification strategy – school-by-grade-by-year level turnover (2 measures) • Two classes of fixed-effects regression models • Grade-by-School: Look within same school and grade across time • lower achievement in years with more turnover? • School-by-Year: Within same school and year across grades • Lower achievement in grades with more turnover?

Findings • Student achievement is lower in years/grades when turnover rates were higher • Math scores are 8-10 percent of a standard deviation lower in years when there is 100 percent turnover (vs. no turnover). ELA smaller effect: 5-6 percent • In a grade level that has 5 teachers, reducing turnover from 2 teachers leaving to none increases math achievement by 3% of SD • Small but meaningful, and applies to all students in grade level • Roughly same magnitude of coefficient on free lunch eligibility • Probably underestimating effect exploiting “idioscyncratic” turnover (ignore systemic effects)

Is the effect compositional? • Control for teaching experience, new to the school, and value-added • Evidence for compositional theory of action • Significant effect remains unexplained by compositional (30-70%) • Also, evidence for disruptive effect beyond changes in teacher composition • Students of stayers do worse in years with more turnover

Turnover Example Student test score measures used to better understand the implications of turnover of students Value-added measures allowed for distinguishing compositional effects of turnover from disruptive effects

Example 4: to understand Teaching & Learning, the case of Persistent Learning • Final example • explores what students learn in school and how that impacts their later achievements

Getting on the same page

Getting on the same page Current Prior Prior Teacher Current Teacher

Cross-subject effects Current Other Subject Prior Prior Teacher

Why Might Teachers Vary In Persistence?

Relevant Extant Research

What’s missing (and interesting)? • Few persistence studies • Replication • No cross-subject persistence studies for test performance • Distinguishing general and specific knowledge gains • Few studies of variance in persistence

Research Questions • What is the persistence of teachers’ value-added within and across subject areas? • Does value-added persistence vary by teachers’ ability? • Does value-added persistence vary by students’ background or prior achievement? • Does variation in persistence stem from students’ differential rates of forgetting previously acquired long-term knowledge? • Do school-level characteristics predict variation in teachers’ persistence?

1. What is the persistence of teachers’ value-addedwithin and across subject areas? • Use method from Jacob, Lefgren and Sims (2010) • Predict current test score with students’ prior test score, • Same subject: Gives observed relationship between prior and current score. • Other subject: Gives observed relationship between prior and current score in other subject. • Instruments prior score with twice lagged score (only using variation in score that was there the prior year) • Same subject: How much of long-term knowledge is retained • Other subject: How much long-term knowledge is general (applies to both subjects) • Instruments prior knowledge with prior teacher value-added (only using variation in score that came from teacher) • Same subject: How much of learning from teacher is persistent • Other subject: How much learning from teacher is general

Cross subject • Replace the outcome measure with the other subject score (and classroom fixed effects with other subject classroom fixed effects) • Long-run knowledge • Same approach captures percent of long-term knowledge that is general knowledge • Persistence • Same approach captures percent of teacher effect that is persistent through only general knowledge

Context: Correlations ELA teachers’ value added Not Much

Research Question 1 What is the persistence of teachers’ value-addedwithin and across subject areas?

Persistence of Observed Knowledge, Long Term Knowledge, and Teacher Value Added Retain most long-term knowledge Retain about 20% of learned knowledge

About 60% of long-term goes across subjects Cross-subject Learning from ELA teachers affects future math 3+ times as much as Math teachers affect ELA (almost as much as math learning affects math)

Research Question 2 Does value-added persistence vary by teachers’ ability?

Table 4: Heterogeneity of ELA Teachers’ Persistence

Table 5: Heterogeneity of Math Teachers’ Persistence

Measuring and Enhancing Teacher Effectiveness: Data , Methods, and Policies