Teacher Productivity & Models of Employer Learning

Teacher Productivity & Models of Employer Learning Economic Models in Education Research Workshop University of Chicago April 7, 2011 Douglas O. Staiger Dartmouth College

Teacher Productivity & Models of Employer Learning • Teacher productivity • Estimating value added models • Statistical tests of model assumptions • Stability of the effects • Models of employer learning • Searching for effective teachers -- heterogeneity • Career concerns – heterogeneity & effort

Teacher Productivity • Huge non-experimental literature on “teacher effects” • Non-experimental studies estimate standard deviation in teacher-effect of .10 to .25 student-level standard deviations (2-5 percentiles) each year. • Key findings in non-exp literature: • Teacher effects unrelated to traditional teacher credentials • Payoff to experience steep in first 3 years but flat afterwards • Predict sizable differences with 1-3 years prior performance • One experimental study (TN class-size experiment) yields similar estimate of variance.

Variation in Value Added Within & Between Groups of Teachers in NYC, by Teacher Certification

Variation in Value Added Within & Between Groups of Teachers in LAUSD, by Teacher Certification

Variation in Value Added Within & Between Groups of Teachers in LAUSD, by Years of Experience

Variation in Value Added Within & Between Groups of Teachers in LAUSD, by Prior Value Added

How Are Teacher Effects Estimated? • Growing use of “value added” estimates to identify effective teachers for pay, promotion, and professional development. • But growing concern that statistical assumptions needed to estimate teacher effect are strong and untested – are these “causal” effects of teachers?

Basics of Value Added Analysis Teacher value added compares actual student achievement to a counterfactual expectation Difference between actual and expected achievement, averaged over teacher’s students (average residual) Expected achievement is average achievement for students who looked similar at start of year Same prior-year test scores Same demographics, program participation Same characteristics of peers in classroom

Similar teacher residual by OLS, RE, FE (β driven by within variation). What matters is whether X includes baseline score & peer measures. Estimating Value Added 1. Estimating Non-Experimental Teacher Effects

Estimating Value Added 2. Generating Empirical Bayes Estimates of Non-Experimental Teacher Effects .

Empirical Bayes Methods • Goal: Forecast teacher performance next year (BLUP) • Forecast is prediction of persistent teacher component • “Shrinkage” estimator = posterior mean: E(μ|M) = Mβ • Weight (β) placed on measure increase with: • Correlation with persistent component of interest • Reliability with which measure is estimated(which may vary by teacher – e.g. based on sample size) • Can apply to any measure (value added, video rating, etc.)or combination of measures (composite estimates)

Error components • Performance measure (Mjc) for teacher j in classroom c is noisy estimate of persistent teacher effect (μj). • Noise consists of two independent components: • classroom component (θjc) representing peer effects, etc. • sampling error (νjc) if measure averages over students, videos, raters, etc. (variance depends on sample size) • Model for error prone measure: Mjc = μj + θjc + νjc

Prediction in simple case • Using one measure (M) to predict teacher performance on a possibly different measure (M') in a different classroom simplifies to predicting the persistent teacher component:E(μ'j|Mjc) = Mjcβj • Optimal weights (βj) analogous to regression coefficients:βj = Cov(μ'j,Mjc)/Var(Mjc) = Cov(μ'j,μj)/[Var(μj)+Var(θjc)+Var(νjc)] = {Cov(μ'j,μj)/Var(μj)}*{Var(μj)/[Var(μj)+Var(θjc)+Var(νjc)]} = {β if Mjc had no noise}*{reliability of Mjc}

Two Key Measurement Problems Reliability/Instability Imprecision  transitory measurement error E.g., low correlation across classrooms Validity/Bias Persistently misrepresent performance (e.g. student sorting) Test scores capture only one dimension of performance Depends on design, content, & scaling of test Validity & reliability determine a measures ability to predict performance Correlation of measure with true performance = (correlation of persistent part of measure with true performance) * (square root of reliability) E.g., Teacher certification versus value added

Statistical Tests of Model Assumptoins • Experimental forecasting test (Kane & Staiger) • Observational specification tests (Rothstein) • Quasi-experimental forecasting test (Carrell & West)

What Kane/Staiger do • Randomly assign 78 pairs of teachers to classrooms in LAUSD elementary schools • Provides experimental estimate of parameter of interest • If a given classroom of students were to have teacher A rather than teacher B, how much different would their average test scores be at the end of the year? • Evaluate whether pre-experimental estimates from various value-added models predict experimental results

Experimental Design • All NBPTS applicants from Los Angeles area. • For each NBPTS applicant, identified comparison teachers working in same school, grade, calendar track. • LAUSD chief of staff wrote letters to principals inviting them to draw up two classrooms that they would be willing to assign to either teacher. • If principal agreed, classroom rosters (not individual students) were randomly assigned by LAUSD on the day of switching. LAUSD made paper copies of rosters on day of switch. • Yielded 78 pairs of teachers (156 classrooms and 3500 students) for whom we had estimates of “value-added” impacts from the pre-experimental period.

All standardized by grade and year. LAUSD Data • Grades 2 through 5 • Three Time Periods: • Years before Random Assignment: Spring 2000 through Spring 2003 • Years of Random Assignment: Either Spring 2004 or 2005 • Years after Random Assignment: Spring 2005 (or 2006) through Spring 2007 • Outcomes: • California Standards Test (Spring 2004- 2007) • Stanford 9 Tests (Spring 2000 through 2002) • California Achievement Test (Spring 2003) • Covariates: • Student: baseline math and reading scores (interacted with grade), race/ethnicity (hispanic, white, black, other or missing), ever retained, Title I, Eligible for free lunch, Gifted and talented, Special education, English language development (level 1-5). • Peers: Means of all the above for students in classrooms. • Fixed Effects: School x Grade x Track x Year • Sample Exclusions: • >20 percent special education classes • Fewer than 5 and more than 36 students in class

Evaluating Value Added 3. Test validity of VAj against experimental outcomes .

How much of the variance in (μ2p –μ1p) is “explained” by (VA2p –VA1p)?

Not Clear How To Interpret Fade-out • Forgetting, transitory teaching-to-test Value added overstates long-term impact • Knowledge that is not used becomes inoperable Need string of good teachers to maintain effect • Grade-specific content of tests not cumulative Later tests understate contribution of current teacher • Students of best teachers mixed with students of worst teachers in following year, and new teacher will focus effort on students who are behind (peer effects). no fade-out if teachers were all effective

Reconciling with Rothstein(2010)

correlation with VAM4: .94 .93 .98 .998 • Both of us find that past teachers have lingering effects due to fade-out. • Rothstein finds that richer set of covariates has negligible effects. • While Rothstein speculates that selection on unobservables could cause problems, our results fail to find evidence of bias.

Reconciling Kane/Staiger with Rothstein • Both Rothstein and Kane/Staiger find evidence of fade-out • Rothstein finds current student gain is associated with past teacher assignment, conditional on student’s prior test score. • Consistent with fade out of prior teacher’s effect in Kane/Staiger • Bias in current teacher effect depends on correlation between current & past teacher value added (small in Rothstein & Kane/Staiger data). • Both Rothstein and Kane/Staiger find that after conditioning on prior test score, other observables don’t matter much • Rothstein finds prior student gain is associated with current teacher assignment, conditional on student’s prior test score . • i.e., current teacher assignment is associated with past 2 tests. • Rothstein (and others) finds that controlling for earlier tests has little effect on estimates of teacher effect (corr>.98) • Rothstein speculates that other unobservables used to track students may bias estimates of teacher effects • Kane/staiger find no substantial bias from such omitted factors

Carrell/West – A Cautionary Tale! • Quasi-experimental evidence from AF Academy • Randomized to classes, common test & grading • Estimate teacher effect in 1st-year intro classes • Does it predict performance in 2nd-year class? • Strong evidence of teaching-to-test • Big teacher effects in 1st year • Lower rank have larger “value added” & satisfaction • But predicts worse performance in 2nd year class • AF system facilitated teaching to test

Summary of Statistical Tests • Value-added estimates in low-stakes environment yielded unbiased predictions of causal effects of teachers on short-term student achievement • controlling for baseline score yielded unbiased predictions • Further controlling for peer characteristics yielded highest explanatory power, explaining over 50% of teacher variation • Relative differences in achievement between teacher’s students fade-out at annual rate of .4-.6. Understanding the mechanism is key to long-term benefits of using value added. • Performance measures can go wrong when easily gamed.

Are Teacher Effects Stable? • Different across students within a class? No. • Change over time? • Correlation falls slowly at longer lags • Teacher peer effects (Jackson/Bruegman) • Effect of evaluation on performance (Taylor/Tyler) • Depend on match/context? • Correlation falls when change grades, course. • Correlation falls when change schools (Jackson)

How Should Value Added Be Used? Growing use of value added to identify effective teachers for pay, promotion, and professional development Concern that current value added estimates are too imprecise & volatile to be used in high-stakes decisions Year-to-year correlation (reliability) around 0.3-0.5 Of top quartile one year, >10% in bottom quartile next year No systematic analysis of what this evidence implies for how measures could be used

Models of Employer Learning Motivating facts Large persistent variation across teachers (heterogeneity) Difficult to predict at hire (not inspection good) Predictable after hire (experience good  learning) Return to experience in first few years (cost of hiring)

Searching For Effective Teachers Use simple search model to illustrate how one could use imperfect information on effectiveness to screen teachers Use estimates of model parameters from NYC & LAUSD to simulate the potential gains from screening teachers Evaluate potential gains from: Observing teacher performance for more years Obtaining more reliable information on teacher performance Obtaining more reliable information at time of hire

Simple search model: Setup Teacher effect: μ~N(0,σμ2) Pre-hire signal (if available) Y0~N(μ, σ02 ), reliability = σμ2 /(σμ2 + σ02) #applicants = 10 times natural attrition Constraint: #hired = #dismissed + natural turnover Annual performance on the job (t=1,…,30) Yt~N(μ + βt, σ2 ), reliability = σμ2 /(σμ2 + σ2) Return to experience: βt<0 for early t, cost of hiring Exogenous annual turnover rate (t<30): π Can dismiss up until tenure at t=T

Simple search model: Solution Objective: Maximize student achievement by screening out ineffective teachers using imperfect performance measure Solution is similar to Jovanovic (1979) matching model Principal sets reservation value (rt), increasing with t dismiss after period t if E(μ|Y0,.., Yt) < rt from normal learning model: Reservation value increases because of declining option value No simple analytic solution to general model Numerically estimate optimal rt through simulations

Tenure cutoff in simple case Suppose: No pre-hire signal (new hire is random draw) Tenure after 1 year (no option value) Return to experience only in year 1 (β1<0) f.o.c.: marginal tenured teacher = average teacher next year

Simulation assumptions from NYC & LAUSD Maintained assumptions across all simulations SD of teacher effect: σμ2 = 0.15 (in student SD units; national black-white gap = .8-.9) Turnover rate if not dismissed: π = 5% Assumptions for simplest base case (will be varied later) No useful information at time of hire Reliability of Yt: σμ2 /(σμ2 + σ2) =0.4 (40% reliability) Cost of hiring new teacher: βt = -.07 in 1st year, -.02 in 2nd year Dismissal only after first year (e.g. tenure decision after 1 year)

Simple model: dismiss 80% of probationary teachers(!)

Why dismiss so many probationary teachers? • Differences in teacher effects are large & persistent, relative to short-lived costs of hiring a new teacher • Even unreliable performance measures predict substantial differences in teacher effects • Costs of retaining an ineffective teacher outweigh costs of dismissing an effective teacher • Option value of new hires • For every 5 new hires, one will be highly effective • Trade off short-term cost of 4 dismissed vs. long-term benefit of 1 retained

Why not dismiss so many probationary teachers? • Smaller benefits than assumed in the model? • High turnover rates • Teacher differences that do not persist in future(including if PD can help ineffective teachers) • High stakes  distortion of performance measures • Larger costs than assumed in the model? • Direct costs of recruiting/firing (little effect if added) • Difficulty recruiting applicants (but LAUSD did) • Higher pay required to offset job insecurity(particularly if require teacher-training up front)

Requiring a 2nd or 3rd year to evaluate a probationary teacher is a bad idea.

Allowing a 2nd or 3rd year to evaluate a probationary teacher is a good idea.

Obtaining more reliable information on teacher performance is valuable, little effect on dismissal

Obtaining more reliable information at time of hire is even more valuable, and reduces dismissal rate.

Implications Why do principals set a low tenure bar? Poor incentives (private schools?) Lack verifiable performance information Current up-front training requirements (not necessary?) Lose best teachers if cannot raise pay Why don’t other occupations & professions dismiss 80%? Job ladder – low-stakes entry-level job used to screen MD, JD – require up-front training, job differentiation later Alternatives to current system No up-front investment – can train later Rather than credentials, base certification on performance Develop “job ladder” pre-screen – e.g. initial job where few students put at risk, but reveals your ability (summer school?)

Summary Potential gain is large Could raise average annual achievement gains by ≈0.08 Similar magnitude to STAR class-size experiment and to recent results from charter school lotteries Gains could be doubled if had more reliable performance measure, and tripled if observed this pre-hire Select only the most effective teachers, and do it quickly May be practical reasons limiting success of this strategy May require rethinking teacher training & job ladder Focused on screening, but other uses may yield large gains

Combining Heterogeneity & Effort: Model of Career Concerns • Gibbons & Murphy (1992) • Output (yt) is the sum of ability (η), effort (at), and noise (et). • Workers risk-averse, convex costs of effort • Information imperfect, but symmetric, so firms pay expected output.

Combining Heterogeneity & Effort: Model of Career Concerns • Gibbons & Murphy (1992) • Simple optimal linear contract:wt = ct + bt*( yt - ct ), and ct = at* + mt-1 • Base pay (ct) is expected value at t-1 of output at t, and is sum of two terms: • equilibrium effort (at*) – an experience effect. • Posterior mean of ability (mt-1) based on earlier output • Incentive payment depends on: • How much output exceeds expectations ( yt - ct ) • Weight (bt*) declines with noise in yt, and grows with experience – early effort rewarded indirectly through impact on beliefs (mt-1)

Teacher Productivity & Models of Employer Learning