420 likes | 506 Views
The Evaluation of Teacher and School Effectiveness Using Growth Models and Value Added Modeling: Hope Versus Reality. Robert W. Lissitz University of Maryland. http://marces.org/Completed.htm. Thank you. First, I want to thank… The creators of this symposium Burcu Kaniskan
E N D
The Evaluation of Teacher and School Effectiveness Using Growth Models and Value Added Modeling:Hope Versus Reality Robert W. Lissitz University of Maryland http://marces.org/Completed.htm Maryland Assessment Research Center for Education Success
Thank you • First, I want to thank… • The creators of this symposium • BurcuKaniskan • The State of Maryland • MARCES: • Laura Reiner, Yuan Zhang, Xiaoshu Zhu, and Dr. Bill Schafer • Drs. XiaodongHou and Ying Li • Yong Luo, Matt Griffin, Tiago Calico, and Christy Lewis
Preview • Overview of the Literature: • Reliability • Validity • Application of VAM to real data • Direction of VAM in the future
Introduction RACE TO THE MIDDLE • The federal government is asking psychometricians to help make decisions • Race to the Top evaluating teachers and schools • Earlier: No Child Left Behind (“Race to the Middle”) repealed the law of individual differences • The government wants a system that • Pressures educational administrations to do the right thing • Combats the teachers’ unions perceived as obstacles • Seems to assume that teachers don’t want to teach effectively
Introduction and History WHAT IS VAM? • Value-added modeling (VAM) is a system that we hope can determine the effectiveness of some mechanism • Usually teachers or schools • Most popular models include • Simple regression, Transitions between performance levels in adjacent grades, Mixed effects or multilevel regression models (Teacher or school as level 2 effect) • Models students’ performance over or under expectation, aggregated by their teacher or school (usually normative)
Introduction VAM: CHALLENGES – CRITICISM • Nonrandom assignment of students to teachers • Past effects or nuisance variables not controlled by use of prior performance level • Bias reduced using multiple prior measures, but not eliminated • Advantaged by having your class unsuccessful last year • “Dynamic” interaction between students and teachers • Association between teacher effectiveness and student characteristics • Effects may have different influence on students of different ability • Testing is selective • Many teachers with subjects not tested • Memphis, TN – VAM does not apply to 70% of teachers
reliability GENERALIZABILITY • Think of the reliability of VAM as a generalizability problem. Are inferences you draw from one situation true in another situation?
Reliability STABILITY OVER A ONE-YEAR PERIOD • Mandeville (1988): • School effectiveness estimates were stable in the 0.34 to 0.66 range of correlations • Large differences across grade level and subject matter • McCaffrey (2009): • Teacher effect estimates one year apart had correlations around 0.2 to 0.3 • Teaching itself may not be a stable phenomenon • Variability may be due to actual performance changes from year to year; instability may be intractable
Reliability STABILITY OVER A SHORT PERIOD OF TIME • Sass (2008) and Newton, et al (2010): • Estimates of teacher effectiveness from test-retest assessments over a short time period • Correlations in the range of 0.6 • Results may indicate a real phenomenon, but modest
Reliability STABILITY ACROSS GRADE AND SUBJECT • Mandeville & Anderson (1987) and others (Rockoff, 2004; Newton, et al, 2010): • Effectiveness fluctuates across grade and subject matter • Stability, though modest, found more often with math courses, less often with reading courses • Does success depend on what class you are assigned rather than your ability? To some extent it does. • Serious issues of fairness and comparability
Reliability STABILITY AT THE SCHOOL LEVEL • Newton, et al (2010): • Students who are less advantaged, ESL, or on a lower track can have a negative impact on teacher effect estimates • Perception that entire school is good or bad is very popular, but generally untrue • Different grades and different subjects get different evaluations • Bottom line: • Rankings or groupings of schools or teachers (e.g., quintiles) are not highly stable.
Reliability STABILITY ACROSS TEST FORMS • Sass (2008): • Top quintile and bottom quintile seem the most stable • Correlation of teacher effectiveness in those groups was 0.48 across comparable exams over a short time • Time extended to a year between tests: correlation dropped to 0.27 • Papay (2011): • Three different tests • Rank order correlations of teacher effectiveness across time ranged from 0.15 to 0.58 across different tests • Test timing and measurement error have effects
Reliability STABILITY ACROSS STATISTICAL MODELS • Tekwe, et al (2004): • Compared four similar regression models • Unless such models involve different variables, results tend to be similar • Dawes (1979): • Linear composites seem to be pretty much the same regardless of how one gets the weights • Hill, et al (2011): • A big convergent validity problem
Reliability SOURCES OF UNRELIABILITY • 30-60% of variation is due to sampling error • In part due to small numbers of students as the basis of effectiveness estimates • Regression to the mean • Class sizes vary within a school or district • Classroom measures based on fewer students tend toward the mean • Bayes estimates in multilevel modeling introduces bias that is a function of sample size • Other occupations: Lack of consistency of performance is typical of complex professions – baseball players, stock investors…
Validity JOB APPLICATIONS AS PREDICTIVE MEASURES • Years of experience, advanced degrees, certification, licensure, school quality, etc. have low relationship (if any) to teacher effectiveness • National Board little better than a coin flip (Sanders and Wright, 2008) • Knowledge of mathematics positively correlated with teaching mathematics effectively • VAM estimates provide better measures of teacher impact on student test scores than measures on teacher’s job application • Having trouble isolating teaching factors that relate to VAM
Validity TRIANGULATION OF MULTIPLE INDICATORS • Reliability is the easy thing to study – Validity is much harder • Goe, et al (2008): • Context for evaluation • To draw valid conclusions, teachers should be compared to other teachers who: • Teach similar courses • In same grade • In a similar context • Assessed by same or similar examination • Similar student characteristics
Validity COMPARABILITY • Student ability is correlated with growth and status • Gifted students learn at a faster rate • Gifted students and their teachers have an advantage • Interaction between student ability and teachers’ opportunity to be effective
Validity CAUSALITY, RESEARCH DESIGN, AND THEORY • Rubin (2004): • Missing data are not missing at random • Missing in a way that confounds results and complicates inferences • We do not have a clear idea what our hypothesis is • Multiple operational definitions of growth, but no developmental science for the phenomenon • No standardization for effectiveness
Validity CAUSALITY, RESEARCH DESIGN, AND THEORY • Without carefully controlled experiments, we cannot isolate teacher effects • Students have multiple teachers and other influences • Effect of prior performance and experience • What do we even mean by teachers have a causal effect? • How do teachers and schools impart their supposed effect? • How is it internalized by the student? • Lord’s paradox • ANCOVA does not lead to unambiguous interpretations • We do not know what optimal teacher decision-making is
Validity WHY SHOULD WE CARE? • Are teachers the most important factor determining student achievement? NO. • Nye, et al (2004): 11% of variation in student gains explained by teacher effects • Rockoff (2004): Teacher effects 5.0-6.4% School effects 2.7-6.1% Student fixed effects 59-68%
Validity WHY SHOULD WE CARE? • Importance of classroom context • Kennedy (2010), etc.: • Situational factors influence teacher success • Time on task, materials, work assignments Might add controlling behavioral issues; mainstreaming only students who are willing/capable to be non-disruptive • Technical assistance with teaching (computers..) • New teacher’s Goal: Maximize the context for learning
Validity WHY SHOULD WE CARE? • New paradigm? – different orientation toward the teaching - learning process • Teacher optimizes the context of the learning environment • Adding to motivation • Preventing disruption • Providing opportunity for enhanced learning engagement • Use of assistive teaching devices (computers) will change teacher’s role • Develop a learning science • Current paradigm emphasizes immediate generality and immediate usage, with questionable validity • Instead, create laboratory for education science
OUR STUDY COMPARING MODELS USING REAL DATA • The MARCES Center has studied 11 of the simplest models that might be applied • The full VAM report and the full text supporting this presentation can be accessed at • http://marces.org/Completed.htm
OUR STUDY COMPARING MODELS USING REAL DATA • We obtained 3 years of data on the same students, linked to their teachers • Students divided into four cohorts: (N ≈ 5000 per cohort) • Math and reading data from yearly spring state assessment (2008-2010) • No vertical scale • Horizontally equated from year to year • VAM models chosen for comparison do not require vertical scaling • Nine models compare growth from first to second year • Two models compare growth from first and second to third year
OUR STUDY MODELS Quantile regression conditional on prior year(s) – Betebenner usingpercentiles Simplification using deciles of students Simplification using conditional deciles of z-scores (effect size) - Thum Least squares regression predicted by prior year(s) Models using spline scores to create vertical scale - Schafer Transition models
OUR STUDY • TRSG rewards students for maintaining previous status and for growth within and across performance levels • Reward increases with higher performance level status • TRANSITION MODELS: Performance Levels • TRSG rewards students for maintaining previous status and for growth within and across performance levels • TRUD values reflect growth as well as decreased performance, but not status • TRUG rewards students only for growth and does not punish for regressing MODELS
OUR STUDY • TRSG rewards students for maintaining previous status and for growth within and across performance levels • Reward increases with higher performance level status • TRANSITION MODELS: Performance Levels • TRSG rewards students for maintaining previous status and for growth within and across performance levels • TRUD values reflect growth as well as decreased performance, but not status • TRUG rewards students only for growth and does not punish for regressing MODELS
OUR STUDY • TRSG rewards students for maintaining previous status and for growth within and across performance levels • Reward increases with higher performance level status • TRANSITION MODELS: Performance Levels • TRSG rewards students for maintaining previous status and for growth within and across performance levels • TRUD values reflect growth as well as decreased performance, but not status • TRUG rewards students only for growth and does not punish for regressing MODELS
OUR STUDY INTER-CORRELATION OF STUDENT GROWTH SCORES FROM EACH MODEL AND THEIR DIMENSIONALITY • Factor analysis of student growth from these models intercorrelated growth in year 1-2 and replicated for years 2-3 • One dimension accounts for largest percentage of variance • Great deal of noise in results • Over 80% of variance undefined by first dimension • Results of factor analysis essentially the same for each pair of years, for each cohort and for each content area
OUR STUDY INTER-CORRELATION OF STUDENT GROWTH SCORES AND THEIR DIMENSIONALITY • Example: Scree Plot for Math 2008-2009, Cohort 1
OUR STUDY THE CORRELATION BETWEEN GROWTH IN MATH AND GROWTH IN READING
OUR STUDY THECORRELATION BETWEEN THE TWO GROWTH PERIODS (YEAR 1-2 AND YEAR 2-3)
OUR STUDY TEACHER EFFECTIVENESS: RELIABILITY
OUR STUDY SCHOOL EFFECTIVENESS: RELIABILITY
OUR STUDY COMPARISON BETWEEN SCHOOL AND TEACHER EFFECTIVENESS • Levels of Effectiveness • 2008-2009 (Results are similar in 2009-2010)
OUR STUDY METHODOLOGICAL ISSUES • Math Cohort 1 in Year 2008-2009
OUR CONCLUSIONS The model you use can make a difference • Decide how to balance status against growth • No standardization for the modeling of VAM • Traditional qualitative approaches used by principals are not likely to be an improvement on VAM • Using either approach for high stakes testing and decision-making seems premature • Combining two procedures that are not highly valid will not necessarily result in a more valid system
OUR CONCLUSIONS Interactions should be modeled • All students do NOT react the same way • Teachers are NOT the same over time • Many differences exist within a school context effects SHOULD BE STUDIED Teacher’s role should be changed Need to create a learning science Context may add to the modest results for teachers and schools
OUR CONCLUSIONS change in instruction involving supportive technology • Paradigm shift in education may be closer than we think • Cognitive, computer, econometric, engineering, neuro scientists are beginning to study education • Field can be expected to change as these researchers and their students become more involved • Teacher’s decision-making becoming more systematic • Radical changes for the better are expected
OUR CONCLUSIONS Vam for high stakes • Right now, I do not encourage using VAM for high stakes applications • Might use VAM for initial screening, then follow-up • It makes a difference which VAM model we implement • Choose the model based on policy decisions that capture the goals, values and intent of the school system • Factors not in teacher’s control will have an effect
OUR CONCLUSIONS relatE VAM to what teachers are doing • Create causal models and explore with experiments • Effective teaching requires good measurement, and presents a great challenge and is a worthy goal… Interested in implementing a vam? Read Finlay and Manavi (2008) and others first • Practical political issues of using VAM in schools involve unions, federal government, state government, special education advocates… and the list goes on and on …
Questions? Visit http://marces.org to find references, the full text of this talk, comparison of value-added models and there will be a MARCES conference on VAM (October 18 & 19) Robert W. Lissitz University of Maryland Maryland Assessment Research Center for Education Success