160 likes | 274 Views
Evaluating and Improving Value-Added Modeling. Douglas N. Harris University of Wisconsin at Madison. Background. IES Teacher Quality Grant; Harris and Sass 2006 IES conference November mini-conference at UW-Madison Caveat: Multidisciplinary group but “econ-centric” presentation. Summary.
E N D
Evaluating and Improving Value-Added Modeling Douglas N. Harris University of Wisconsin at Madison
Background • IES Teacher Quality Grant; Harris and Sass • 2006 IES conference • November mini-conference at UW-Madison • Caveat: Multidisciplinary group but “econ-centric” presentation
Summary • Purposes of value-added modeling (VAM) • Criteria for evaluating VAM • Some problematic results • Methodological issues • A research agenda and upcoming conference
Different Purposes • There are main purposes of value-added models: (1) VAM for program evaluation (VAM-P) (2) VAM for accountability (VAM-A) • In both cases, arguably trying to mimic random assignment experiments
Criteria for Evaluating VAM • Different purposes, different criteria for evaluation: Criteria for VAM-P: validity and reliability of the program/policy effect parameter Criteria for VAM-A: validity and reliability of individual personnel effects • Meeting the criteria appears more difficult with VAM-A with hundreds or thousands of parameters
Tentative, But Problematic, Findings • In some VAM-A models, teacher effects are unstable for individual teachers over time • When comparing teacher effects estimated from the same data but different VAM-A models, the results are weakly correlated • VAM-A teacher effects are imprecise, making it difficult to distinguish teacher effectiveness with the usual degree of confidence
Methodological Issues • Assumptions about student test scores • Assumptions about teaching and learning • Others: amount of information, complexity of computation, missing data • Significance of methodological issues vary by purpose (VAM-P vs. VAM-A)
Assumptions about Test Scores • VAM assumes that test scores are on an interval scale - In other words, a one-point increase means the same thing no matter where we start - In other other words, vertical scaling works • Some (many?) psychometricians believe that, despite best efforts, test scores are not really interval scale • Ad hoc adjustments may not solve the problem - non-linear term on right-hand side - grade-by-year fixed effects
Assumptions about Learning • VAM models make assumptions about learning decay of past learning/inputs • All VAM models assume that nothing happens between the test administration and the beginning of the subsequent school year - summer learning loss • VAM models do NOT assume, however, that students learn “smoothly” - some express concern that students learn in spurts in ways that are independent of instructional quality
Assumptions about Teaching • VAM-A assumes that the mediating factors influencing student achievement influence effectiveness of all teachers in the same way - e.g., class size • A specific and important example is the assumption that teachers are equally effective with all types of students
Lots of Assumptions & Problems, But . . . • Even with modest validity and reliability, VAM-A could improve education: - The education system already uses student test scores—and uses them badly - Violations of assumptions per se do not invalidate VAM-A • Little question that VAM-P should be pursued
Short-Term Research Agenda • Follow-up on earlier “problematic” findings - in progress: testing robustness of teacher effects across VAM-A models • Clarify assumptions being made in each type of VAM model • Test sensitivity of VAM results to test scaling (and test type) • Test whether teachers have different levels of effectiveness with different types of students (e.g., different initial test scores)
Long-Term Research Agenda • Test VAM with experiments • Study the effects of VAM-A on school decision-making - Does VAM-A (w/o high stakes) appear to yield better decisions about, for example, the allocation of school resources? - Does VAM-A w/ merit pay result in higher test higher student scores? (i.e. use VAM-P to evaluate VAM-A) - Do these changes in scores reflect real improvements in learning or gaming the system? - Studies in progress
For All Future VAM Work . . . • Be explicit about assumptions and their potential implications • Test the assumptions • Where assumptions fail, compare different models to test for robustness
Steps Down the Path • A larger national conference in Madison, WI in Spring, 2008 • Co-Chairs: Harris, Gamoran, Raudenbush • Program Committee members: Braun, Lockwood, Meyer, Sass • Interdisciplinary • 10 commissioned papers, plus policy discussions
Final Thoughts • There is considerable interest in VAM and policymakers are eager for direction • Is (or should be) near consensus that VAM-P is an important advance - policymakers should push forward in collecting student-level data with unique student identifiers • VAM-A is worth cautious experimentation and further study, but not yet widespread adoption with high-stakes