1 / 42

Robert W. Lissitz University of Maryland

The Evaluation of Teacher and School Effectiveness Using Growth Models and Value Added Modeling: Hope Versus Reality. Robert W. Lissitz University of Maryland. http://marces.org/Completed.htm. Thank you. First, I want to thank… The creators of this symposium Burcu Kaniskan

dewey
Download Presentation

Robert W. Lissitz University of Maryland

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Evaluation of Teacher and School Effectiveness Using Growth Models and Value Added Modeling:Hope Versus Reality Robert W. Lissitz University of Maryland http://marces.org/Completed.htm Maryland Assessment Research Center for Education Success

  2. Thank you • First, I want to thank… • The creators of this symposium • BurcuKaniskan • The State of Maryland • MARCES: • Laura Reiner, Yuan Zhang, Xiaoshu Zhu, and Dr. Bill Schafer • Drs. XiaodongHou and Ying Li • Yong Luo, Matt Griffin, Tiago Calico, and Christy Lewis

  3. Preview • Overview of the Literature: • Reliability • Validity • Application of VAM to real data • Direction of VAM in the future

  4. Introduction RACE TO THE MIDDLE • The federal government is asking psychometricians to help make decisions • Race to the Top evaluating teachers and schools • Earlier: No Child Left Behind (“Race to the Middle”) repealed the law of individual differences • The government wants a system that • Pressures educational administrations to do the right thing • Combats the teachers’ unions perceived as obstacles • Seems to assume that teachers don’t want to teach effectively

  5. Introduction and History WHAT IS VAM? • Value-added modeling (VAM) is a system that we hope can determine the effectiveness of some mechanism • Usually teachers or schools • Most popular models include • Simple regression, Transitions between performance levels in adjacent grades, Mixed effects or multilevel regression models (Teacher or school as level 2 effect) • Models students’ performance over or under expectation, aggregated by their teacher or school (usually normative)

  6. Introduction VAM: CHALLENGES – CRITICISM • Nonrandom assignment of students to teachers • Past effects or nuisance variables not controlled by use of prior performance level • Bias reduced using multiple prior measures, but not eliminated • Advantaged by having your class unsuccessful last year • “Dynamic” interaction between students and teachers • Association between teacher effectiveness and student characteristics • Effects may have different influence on students of different ability • Testing is selective • Many teachers with subjects not tested • Memphis, TN – VAM does not apply to 70% of teachers

  7. reliability GENERALIZABILITY • Think of the reliability of VAM as a generalizability problem. Are inferences you draw from one situation true in another situation?

  8. Reliability STABILITY OVER A ONE-YEAR PERIOD • Mandeville (1988): • School effectiveness estimates were stable in the 0.34 to 0.66 range of correlations • Large differences across grade level and subject matter • McCaffrey (2009): • Teacher effect estimates one year apart had correlations around 0.2 to 0.3 • Teaching itself may not be a stable phenomenon • Variability may be due to actual performance changes from year to year; instability may be intractable

  9. Reliability STABILITY OVER A SHORT PERIOD OF TIME • Sass (2008) and Newton, et al (2010): • Estimates of teacher effectiveness from test-retest assessments over a short time period • Correlations in the range of 0.6 • Results may indicate a real phenomenon, but modest

  10. Reliability STABILITY ACROSS GRADE AND SUBJECT • Mandeville & Anderson (1987) and others (Rockoff, 2004; Newton, et al, 2010): • Effectiveness fluctuates across grade and subject matter • Stability, though modest, found more often with math courses, less often with reading courses • Does success depend on what class you are assigned rather than your ability? To some extent it does. • Serious issues of fairness and comparability

  11. Reliability STABILITY AT THE SCHOOL LEVEL • Newton, et al (2010): • Students who are less advantaged, ESL, or on a lower track can have a negative impact on teacher effect estimates • Perception that entire school is good or bad is very popular, but generally untrue • Different grades and different subjects get different evaluations • Bottom line: • Rankings or groupings of schools or teachers (e.g., quintiles) are not highly stable.

  12. Reliability STABILITY ACROSS TEST FORMS • Sass (2008): • Top quintile and bottom quintile seem the most stable • Correlation of teacher effectiveness in those groups was 0.48 across comparable exams over a short time • Time extended to a year between tests: correlation dropped to 0.27 • Papay (2011): • Three different tests • Rank order correlations of teacher effectiveness across time ranged from 0.15 to 0.58 across different tests • Test timing and measurement error have effects

  13. Reliability STABILITY ACROSS STATISTICAL MODELS • Tekwe, et al (2004): • Compared four similar regression models • Unless such models involve different variables, results tend to be similar • Dawes (1979): • Linear composites seem to be pretty much the same regardless of how one gets the weights • Hill, et al (2011): • A big convergent validity problem

  14. Reliability SOURCES OF UNRELIABILITY • 30-60% of variation is due to sampling error • In part due to small numbers of students as the basis of effectiveness estimates • Regression to the mean • Class sizes vary within a school or district • Classroom measures based on fewer students tend toward the mean • Bayes estimates in multilevel modeling introduces bias that is a function of sample size • Other occupations: Lack of consistency of performance is typical of complex professions – baseball players, stock investors…

  15. Validity JOB APPLICATIONS AS PREDICTIVE MEASURES • Years of experience, advanced degrees, certification, licensure, school quality, etc. have low relationship (if any) to teacher effectiveness • National Board little better than a coin flip (Sanders and Wright, 2008) • Knowledge of mathematics positively correlated with teaching mathematics effectively • VAM estimates provide better measures of teacher impact on student test scores than measures on teacher’s job application • Having trouble isolating teaching factors that relate to VAM

  16. Validity TRIANGULATION OF MULTIPLE INDICATORS • Reliability is the easy thing to study – Validity is much harder • Goe, et al (2008): • Context for evaluation • To draw valid conclusions, teachers should be compared to other teachers who: • Teach similar courses • In same grade • In a similar context • Assessed by same or similar examination • Similar student characteristics

  17. Validity COMPARABILITY • Student ability is correlated with growth and status • Gifted students learn at a faster rate • Gifted students and their teachers have an advantage • Interaction between student ability and teachers’ opportunity to be effective

  18. Validity CAUSALITY, RESEARCH DESIGN, AND THEORY • Rubin (2004): • Missing data are not missing at random • Missing in a way that confounds results and complicates inferences • We do not have a clear idea what our hypothesis is • Multiple operational definitions of growth, but no developmental science for the phenomenon • No standardization for effectiveness

  19. Validity CAUSALITY, RESEARCH DESIGN, AND THEORY • Without carefully controlled experiments, we cannot isolate teacher effects • Students have multiple teachers and other influences • Effect of prior performance and experience • What do we even mean by teachers have a causal effect? • How do teachers and schools impart their supposed effect? • How is it internalized by the student? • Lord’s paradox • ANCOVA does not lead to unambiguous interpretations • We do not know what optimal teacher decision-making is

  20. Validity WHY SHOULD WE CARE? • Are teachers the most important factor determining student achievement? NO. • Nye, et al (2004): 11% of variation in student gains explained by teacher effects • Rockoff (2004): Teacher effects 5.0-6.4% School effects 2.7-6.1% Student fixed effects 59-68%

  21. Validity WHY SHOULD WE CARE? • Importance of classroom context • Kennedy (2010), etc.: • Situational factors influence teacher success • Time on task, materials, work assignments Might add controlling behavioral issues; mainstreaming only students who are willing/capable to be non-disruptive • Technical assistance with teaching (computers..) • New teacher’s Goal: Maximize the context for learning

  22. Validity WHY SHOULD WE CARE? • New paradigm? – different orientation toward the teaching - learning process • Teacher optimizes the context of the learning environment • Adding to motivation • Preventing disruption • Providing opportunity for enhanced learning engagement • Use of assistive teaching devices (computers) will change teacher’s role • Develop a learning science • Current paradigm emphasizes immediate generality and immediate usage, with questionable validity • Instead, create laboratory for education science

  23. OUR STUDY COMPARING MODELS USING REAL DATA • The MARCES Center has studied 11 of the simplest models that might be applied • The full VAM report and the full text supporting this presentation can be accessed at • http://marces.org/Completed.htm

  24. OUR STUDY COMPARING MODELS USING REAL DATA • We obtained 3 years of data on the same students, linked to their teachers • Students divided into four cohorts: (N ≈ 5000 per cohort) • Math and reading data from yearly spring state assessment (2008-2010) • No vertical scale • Horizontally equated from year to year • VAM models chosen for comparison do not require vertical scaling • Nine models compare growth from first to second year • Two models compare growth from first and second to third year

  25. OUR STUDY MODELS Quantile regression conditional on prior year(s) – Betebenner usingpercentiles Simplification using deciles of students Simplification using conditional deciles of z-scores (effect size) - Thum Least squares regression predicted by prior year(s) Models using spline scores to create vertical scale - Schafer Transition models

  26. OUR STUDY • TRSG rewards students for maintaining previous status and for growth within and across performance levels • Reward increases with higher performance level status • TRANSITION MODELS: Performance Levels • TRSG rewards students for maintaining previous status and for growth within and across performance levels • TRUD values reflect growth as well as decreased performance, but not status • TRUG rewards students only for growth and does not punish for regressing MODELS

  27. OUR STUDY • TRSG rewards students for maintaining previous status and for growth within and across performance levels • Reward increases with higher performance level status • TRANSITION MODELS: Performance Levels • TRSG rewards students for maintaining previous status and for growth within and across performance levels • TRUD values reflect growth as well as decreased performance, but not status • TRUG rewards students only for growth and does not punish for regressing MODELS

  28. OUR STUDY • TRSG rewards students for maintaining previous status and for growth within and across performance levels • Reward increases with higher performance level status • TRANSITION MODELS: Performance Levels • TRSG rewards students for maintaining previous status and for growth within and across performance levels • TRUD values reflect growth as well as decreased performance, but not status • TRUG rewards students only for growth and does not punish for regressing MODELS

  29. OUR STUDY INTER-CORRELATION OF STUDENT GROWTH SCORES FROM EACH MODEL AND THEIR DIMENSIONALITY • Factor analysis of student growth from these models intercorrelated growth in year 1-2 and replicated for years 2-3 • One dimension accounts for largest percentage of variance • Great deal of noise in results • Over 80% of variance undefined by first dimension • Results of factor analysis essentially the same for each pair of years, for each cohort and for each content area

  30. OUR STUDY INTER-CORRELATION OF STUDENT GROWTH SCORES AND THEIR DIMENSIONALITY • Example: Scree Plot for Math 2008-2009, Cohort 1

  31. OUR STUDY THE CORRELATION BETWEEN GROWTH IN MATH AND GROWTH IN READING

  32. OUR STUDY THECORRELATION BETWEEN THE TWO GROWTH PERIODS (YEAR 1-2 AND YEAR 2-3)

  33. OUR STUDY TEACHER EFFECTIVENESS: RELIABILITY

  34. OUR STUDY SCHOOL EFFECTIVENESS: RELIABILITY

  35. OUR STUDY COMPARISON BETWEEN SCHOOL AND TEACHER EFFECTIVENESS • Levels of Effectiveness • 2008-2009 (Results are similar in 2009-2010)

  36. OUR STUDY METHODOLOGICAL ISSUES • Math Cohort 1 in Year 2008-2009

  37. OUR CONCLUSIONS The model you use can make a difference • Decide how to balance status against growth • No standardization for the modeling of VAM • Traditional qualitative approaches used by principals are not likely to be an improvement on VAM • Using either approach for high stakes testing and decision-making seems premature • Combining two procedures that are not highly valid will not necessarily result in a more valid system

  38. OUR CONCLUSIONS Interactions should be modeled • All students do NOT react the same way • Teachers are NOT the same over time • Many differences exist within a school context effects SHOULD BE STUDIED Teacher’s role should be changed Need to create a learning science Context may add to the modest results for teachers and schools

  39. OUR CONCLUSIONS change in instruction involving supportive technology • Paradigm shift in education may be closer than we think • Cognitive, computer, econometric, engineering, neuro scientists are beginning to study education • Field can be expected to change as these researchers and their students become more involved • Teacher’s decision-making becoming more systematic • Radical changes for the better are expected

  40. OUR CONCLUSIONS Vam for high stakes • Right now, I do not encourage using VAM for high stakes applications • Might use VAM for initial screening, then follow-up • It makes a difference which VAM model we implement • Choose the model based on policy decisions that capture the goals, values and intent of the school system • Factors not in teacher’s control will have an effect

  41. OUR CONCLUSIONS relatE VAM to what teachers are doing • Create causal models and explore with experiments • Effective teaching requires good measurement, and presents a great challenge and is a worthy goal… Interested in implementing a vam? Read Finlay and Manavi (2008) and others first • Practical political issues of using VAM in schools involve unions, federal government, state government, special education advocates… and the list goes on and on …

  42. Questions? Visit http://marces.org to find references, the full text of this talk, comparison of value-added models and there will be a MARCES conference on VAM (October 18 & 19) Robert W. Lissitz University of Maryland Maryland Assessment Research Center for Education Success

More Related