600 likes | 683 Views
The Evaluation of Teacher and School Effectiveness Using Growth Models and Value Added Modeling: Hope Versus Reality. Robert W. Lissitz University of Maryland. http://marces.org/Completed.htm. Thank you. First, I want to thank… The creators of this symposium Burcu Kaniskan
E N D
The Evaluation of Teacher and School Effectiveness Using Growth Models and Value Added Modeling:Hope Versus Reality Robert W. Lissitz University of Maryland http://marces.org/Completed.htm Maryland Assessment Research Center for Education Success
Thank you • First, I want to thank… • The creators of this symposium • BurcuKaniskan • The State of Maryland • MARCES: • Laura Reiner, Yuan Zhang, Xiaoshu Zhu, and Dr. Bill Schafer • Drs. XiaodongHou and Ying Li • Yong Luo, Matt Griffin, Tiago Calico, and Christy Lewis
Preview • History of VAM • Literature: • Reliability • Validity • Application of VAM • Direction of VAM in the future • Applied viewpoint • Psychometric viewpoint
Introduction and History RACE TO THE MIDDLE • The federal government is asking psychometricians to help make decisions • Race to the Top • Earlier: No Child Left Behind (“Race to the Middle”) • The government wants a system that will • Pressure educational administrations to do the right thing • Combat the teachers’ unions perceived as obstacles
Introduction and History WHAT IS VAM? • Value-added modeling (VAM) is a system that we hope can determine the effectiveness of some mechanism • Usually teachers or schools • Most popular models include • Simple regression • Recording transitions between performance levels in adjacent grades • Mixed effects or multilevel regression models • Teacher or school as level 2 effect
Introduction and History WHAT IS VAM? • Results for each student are usually aggregated • Provides summaries of every student for each teacher • Attempt to show whether students associated with a teacher are performing above or below statistically expected values, or values associated with other teachers • Usually normative in nature
Introduction and History MANDEVILLE – late 1980’s • Investigated school effectiveness and reliability of indicators • Findings: • Some schools are better than others • Differences in quality are inconsistent • Across years • Within schools across grade levels and subject areas
Introduction and History DALLAS – mid-1990’S • 1994: School effects • 1995-1996: Teacher effects • Model with two stages: • Regression to control for “fairness variables” • Gender, ethnicity, English proficiency, SES, etc. • HLM to control for prior achievement, attendance, and school-level variables • High stakes decisions • Bonuses • Frequency of classroom observations
Introduction and History TVAAS – mid-1990’S • Sanders et al. • “Layered” multiple regression model • Effects of teachers and past teachers • Multiple years of prior performance on several subject matter exams • Used to covary out the effect of undesirable student characteristics on growth • Complex interactions could not be statistically removed • Effects may have different influence on students of different ability levels • Probably not possible to eliminate statistically • Future might look at latent classes of students and teachers
Introduction and History CHALLENGES– CRITICISM • Nonrandom assignment of students to teachers • Effect not controlled by use of prior performance level • Bias reduced by using multiple prior measures • “Dynamic” interaction between students and teachers • Association between teacher effectiveness and student characteristics • VAM for high-stakes decisions not for all • Many teachers with subjects not tested • Memphis, TN – VAM does not apply to 70% of teachers
reliability GENERALIZABILITY • Think of the reliability of VAM as a generalizability problem. • Is teacher effectiveness justified as a main effect, or are teachers actually effective in some circumstances and ineffective in others? • If interactions exist, the problem for the principal changes from “who is ineffective?” to “are there conditions in which this teacher can be effective?”
Reliability STABILITY OVER A ONE-YEAR PERIOD • Mandeville (1988): • School effectiveness estimates were stable in the 0.34 to 0.66 range of correlations • Large differences across grade level and subject matter • McCaffrey (2009): • Teacher effect estimates one year apart had correlations around 0.2 to 0.3 • Teaching itself may not be a stable phenomenon • Variability may be due to actual performance changes from year to year; instability may be intractable
Reliability STABILITY OVER A SHORT PERIOD OF TIME • Sass (2008) and Newton, et al (2010): • Estimates of teacher effectiveness from test-retest assessments over a short time period • Correlations in the range of 0.6 • For high stakes testing, we usually require reliability greater than 0.8 • Still may indicate a real phenomenon, but modest
Reliability STABILITY ACROSS GRADE AND SUBJECT • Mandeville & Anderson (1987) and others (Rockoff, 2004; Newton, et al, 2010): • Stability fluctuates across grade and subject matter • Limited stability found more often with math courses, less often with reading courses • Success depends on what class you are assigned rather than your ability? • Serious issues of fairness and comparability
Reliability STABILITY AT THE SCHOOL LEVEL • Perception that entire school is good or bad is very popular • St. Louis, early 1990’s • Challenged advisory committee to find a school that remained at the top 3 years in a row • No system that reported back had even one • FedBlue Ribbon Schools • “Winning school in one year was typically not at the top a year or two later” • Bottom line: • Rankings or groupings of schools (e.g., quintiles) are not stable.
Reliability STABILITY ACROSS TEST FORMS • Sass (2008): • Top quintile and bottom quintile seem the most stable • Correlation of teacher effectiveness in those groups was 0.48 across comparable exams over a short time • Time extended to a year between tests: correlation dropped to 0.27 • Papay (2011): • Three different tests • Rank order correlations of teacher effectiveness across time ranged from 0.15 to 0.58 across different tests • Test timing and measurement error have effects
Reliability STABILITY ACROSS STATISTICAL MODELS • Tekwe, et al (2004): • Compared four regression models • Unless models involve different variables, results tend to be similar • Dawes (1979): • Linear composites seem to be pretty much the same regardless of how one gets the weights • Hill, et al (2011): • Convergent validity problem
Reliability STABILITY ACROSS CLASSROOMS • Newton, et al (2010): • Students who are less advantaged, ESL, or on a lower track can have a negative impact on teacher effect estimates • Multiple VAM models were tested • Success of matching teacher characteristics to VAM outcomes was modest • VAM could be used as a criterion to judge other variables, but validity is questionable
Reliability SOURCES OF UNRELIABILITY • Persistent effects (teacher consistency), non-persistent effects (inconsistency), and non-persistence due to sampling error (unknown) • 30-60% of variation is due to sampling error • In part due to small numbers of students as the basis of effectiveness estimates • Regression to the mean • Class sizes vary within a school or district • Classrooms with fewer students tend toward the mean • Bayes estimates in multilevel modeling also introduce bias that is a function of sample size • Other occupations: Lack of consistency is typical of complex professions – baseball players, stock investors…
Validity JOB APPLICATIONS AS PREDICTIVE MEASURES • Years of experience, advanced degrees, certification, licensure, school quality, etc. have low relationships (if any) to teacher effectiveness • Weak relationship between effectiveness and advanced degree • Knowledge of mathematics positively correlated with teaching mathematics effectively • VAM estimates provide better measures of teacher impact on student test scores than measures on teacher’s job application
Validity TRIANGULATION OF MULTIPLE INDICATORS • Goe, et al (2008): • Context forevaluation • Teachers should be compared to other teachers who: • Teach similar courses • In same grade • In a similar context • Assessed by same or similar examination • Probably necessary to establish validity
Validity COMPARABILITY • Ability is very likely correlated with growth and status • Do gifted students learn at the same rate as others? • Gifted students and their teachers have an advantage • Interaction between student ability and teachers’ ability to be effective • Mixture models are in development
Validity CAUSALITY, RESEARCH DESIGN, AND THEORY • Rubin (2004): • Missing data is not missing at random • Missing in a way that confounds results and complicates inferences • We do not have a clear idea what our hypothesis is • Multiple operational definitions of growth, but no developmental science for the phenomenon
Validity CAUSALITY, RESEARCH DESIGN, AND THEORY • Without carefully controlled experiments, we cannot isolate teacher effects • Students have multiple teachers • Influence of prior performance and experience • What do we even mean by causal effect? • How do teachers and schools impart their effect? • How is it internalized by the student? • Lord’s paradox • ANCOVA does not lead to unambiguous interpretations • Only experimental efforts will provide adequate results • Eminent faculty member: teacher decision-making - unclear what is optimal
Validity WHY SHOULD WE CARE? • Are teachers the most important factor determining student achievement? • Nye, et al (2004): 11% of variation in student gains explained by teacher effects • Rockoff (2004): Teacher effects 5.0-6.4% School effects 2.7-6.1% Student fixed effects 59-68%
Validity WHY SHOULD WE CARE? • Importance of classroom context • Kennedy (2010), etc.: • Situational factors influence teacher success • Time, materials, work assignments • Controlling behavioral issues; mainstreaming only students who are willing/capable to be non-disruptive • Technical assistance with teaching (computers..) • New teacher’s Goal: Maximize context for learning
Validity WHY SHOULD WE CARE? • New paradigm– different orientation toward the learning process • Teacher optimizes the context of the classroom • Adding to motivation • Preventing disruption • Providing opportunity for enhanced learning engagement • Use of assistive teaching devices (computers) will change teacher’s role • Develop a learning science • Current paradigm emphasizes external validity and immediate generality • Instead, create laboratory for education science
Validity WHY SHOULD WE CARE? • Fairness • Little evidence VAM is ready for high stakes use • But… Is it less fair than traditional personnel selection that focuses on advanced degrees and certificates, more credit hours, and working more years? Classroom observations?
OUR STUDY COMPARING MODELS USING REAL DATA • The MARCES Center has studied 11 of the simplest models that might be applied • The full VAM report and the full textsupportingthis presentation can be accessed at • http://marces.org/Completed.htm
OUR STUDY COMPARING MODELS USING REAL DATA • We obtained 3 years of data on the same students, linked to their teachers • Students divided into four cohorts: (N ≈ 5000 per cohort) • Math and reading data from yearly spring state assessment (2008-2010) • No vertical scale • Horizontally equated from year to year • VAM models chosen for comparison do not require vertical scaling • Nine models compare growth from first to second year • Two models compare growth from first and second to third year
TABLE 2: Data used in our study
OUR STUDY MODELS
OUR STUDY MODELS • BETEBENNER’S MODEL • Used in Colorado • Looks at conditional percentile of each student’s performance in the second year, compared to other students who started in same percentile the first year • Aggregates conditional percentiles of students exposed to the same teacher • QRG1 uses prior year to condition the percentile the next year • BETEBENNER’S MODEL • Used in Colorado • Looks at conditional percentile of each student’s performance in the second year, compared to other students who started in same percentile the first year • Aggregates conditional percentiles of students exposed to each teacher • ConD is a simplification: aggregates students into deciles one year and compares to deciles the second year • BETEBENNER’S MODEL • Used in Colorado • Looks at conditional percentile of each student’s performance in the second year, compared to other students who started in same percentile the first year • Aggregates conditional percentiles of students exposed to each teacher • QRG2 uses 2 prior years to condition the percentile the 3rd year
OUR STUDY MODELS • THUM’S MODEL • Similar to ConD, but looks at effect size • Uses z score to identify student’s performance level compared to the average student the first year • In second year, compares student’s z score to students who started at same z position (within a decile) in the prior year • Conditional z scores aggregated for each teacher to provide measure of effectiveness • THUM’S MODEL • Our simplification: z score conditional on prior deciles: • Rank order all students’ year one scale scores; divide into 10 deciles • Compute mean of year 2 scale scores for students within each decile • Compute deviation scores from the decile mean of year 2 scale scores for students within each decile • Compute pooled within-decile SD of year 2 scale scores • Compute growth z score for each student
OUR STUDY MODELS • ORDINARY LEAST SQUARES REGRESSION • Aggregates errors of prediction across teachers to see which teacher’s students tend to perform above or below prediction OLS2 Independent variable: first two years’ scale scores Effectiveness measure: deviation from expected scale score for year three OLS1 Independent variable: first year scale score Effectiveness measure: deviation from expected scale score for year two
OUR STUDY MODELS • REGRESSION USING SPLINE SCORES • Calculated with scores that had been transformed by a spline function • Gives relational meaning to points along the performance continuum across grades • Builds a quasi-vertical scale without common items • Transformation matched to cut scores for 3 proficiency levels: basic, proficient, advanced OLSS applies ordinary least squares to the spline scale scores and looked at deviations from predicted DIFS subtracts spline function transformed score at year 1 from the transformed score at year 2, as though they were a true vertical scale
OUR STUDY • TRANSITION MODELS • Used in Delaware and Arkansas • Classify students into categories in year one (basic, proficient, advanced) • Divide each category into three subcategories • Observe year two category conditional on year one performance • Matrix associated with transition from level at year one to level at year two • Values represent importance of each transition; determined by educators • TRUG rewards students only for growth • Does not punish for regressing • TRANSITION MODELS • Used in Delaware and Arkansas • Classify students into categories in year one (basic, proficient, advanced) • Divide each category into three subcategories • Observe year two category conditional on year one performance • Matrix associated with transition from level at year one to level at year two • Values represent importance of each transition; determined by educators • TRUD values reflect growth as well as decreased performance • Does not reward for status • TRUG rewards students only for growth • Does not punish for regressing • Does not distinguish much between amounts of growth • TRUD values reflect growth as well as decreased performance • Does not reward for status • TRSG rewards students for maintaining previous status and for growth within and across performance levels • Reward increases with higher performance level status • TRANSITION MODELS • Used in Delaware and Arkansas • Classify students into categories in year one (basic, proficient, advanced) • Divide each category into three subcategories • Observe year two category conditional on year one performance • Matrix associated with transition from level at year one to level at year two • Values represent importance of each transition; determined by educators • TRSG rewards students for maintaining previous status and for growth within and across performance levels • Reward increases with higher performance level status MODELS
OUR STUDY INTER-CORRELATION OF STUDENT GROWTH SCORES AND THEIR DIMENSIONALITY • Each student had growth calculation from year 1-2 and year 2-3 • Factor analysisof student growth from these models intercorrelated for year 1-2 and replicated for 2-3 • One dimension accounts for largest percentage of variance • Great deal of noise in results • Over 80% of variance undefined by first dimension • Results of factor analysis same for eachpair of years, for each cohort and foreach content area
OUR STUDY INTER-CORRELATION OF STUDENT GROWTH SCORES AND THEIR DIMENSIONALITY • Example: Scree Plot for Math 2008-2009, Cohort 1
OUR STUDY RELATION TO DEMOGRAPHIC VARIABLES AND PRE- AND POSTTEST SCORES • Growth in reading tends to be slightly more correlated with SES and race than growth in math • Correlations between TRSG and pre- and post-tests are strongest among all the models • Correlation between TRSG and pretest around 0.5 • Correlation between TRSG and posttest around 0.8 • Correlations otherwise… • Between pretest and regression-based models: low • Between pretest and transition-based models: medium • Between posttest and regression-based models: higher • Between posttest and transition-based models: lower
OUR STUDY THE CORRELATION BETWEEN GROWTH IN MATH AND GROWTH IN READING • Year 2008-2009
OUR STUDY THE CORRELATION BETWEEN GROWTH IN MATH AND GROWTH IN READING • Year 2009-2010
OUR STUDY THECORRELATION BETWEEN THE TWO GROWTH PERIODS (YEAR 1-2 AND YEAR 2-3) • Math
OUR STUDY THECORRELATION BETWEEN THE TWO GROWTH PERIODS (YEAR 1-2 AND YEAR 2-3) • Reading
OUR STUDY TEACHER EFFECTIVENESS AND TEACHER RELIABILITY • Square Root of Intra-Class Correlations for Year 2008-2009
OUR STUDY TEACHER EFFECTIVENESS AND TEACHER RELIABILITY • Square Root of Intra-Class Correlations for Year 2009-2010
OUR STUDY TEACHER EFFECTIVENESS AND TEACHER RELIABILITY • Year to Year Reliability of Teacher Effectiveness • Between 2008-2009 and 2009-2010
OUR STUDY SCHOOL EFFECTIVENESS AND SCHOOL RELIABILITY • Sq. root of School Intra-Class Correlation for Year 2008-2009
OUR STUDY SCHOOL EFFECTIVENESS AND SCHOOL RELIABILITY • Sq. root of School Intra-Class Correlation for Year 2009-2010
OUR STUDY SCHOOL EFFECTIVENESS AND SCHOOL RELIABILITY • Year to Year Reliability of School Effectiveness • Between 2008-2009 and 2009-2010