210 likes | 347 Views
Rating Performance Assessments of Students With and Without Disabilities: A Generalizability Study of Teacher Bias. Jose-Felipe Martinez-Fernandez Ann M. Mastergeorge.
E N D
Rating Performance Assessments of Students With and Without Disabilities: A Generalizability Study of Teacher Bias Jose-Felipe Martinez-Fernandez Ann M. Mastergeorge UCLA Graduate School of Education & Information StudiesCenter for the Study of EvaluationNational Center for Research on Evaluation, Standards, and Student Testing American Educational Research Association New Orleans April 1-5, 2001
Introduction • Performance assessments are increasingly popular methods for the evaluation of academic performance. • A number of studies have shown that well trained raters can be reliable scorers of performance assessments for the general population of students. • This study addressed whether any bias exists from trained raters when scoring performance assessments of students with disabilities.
Purpose • Compare the sources of score variability for students with and without disabilities in Language Arts and Mathematics performance assessments. • Determine if important differences exist across student groups in terms of variance components, and if so whether rater (teacher) bias plays a role. • Complement results with raters’ perceptions on bias (their own and other’s).
Method • Student and Rater samples come from a larger district-wide validation study involving thousands of performance assessments. • Teachers from each grade and content area were trained as Raters. • A total of 6 studies (each with different raters and students) were performed for 3rd , 7th and 9th grade assessments in Language Arts and Mathematics.
Method(continued) • For each study, 60 assessments (30 from regular education students and 30 from students who received some kind of accommodation) were rated by 4 raters in two occasions. • Raters were aware of each student’s disability status only in the 2ndrating occasion. Bias is defined as systematic differences in the scores across occasions. • No practice or memory effects expected. • Score scale ranges from 1 to 4.
Method(continued) • Two kinds of Generalizability designs: First a “nested-within-disability” design with all 60 students [P(D) x R x O]. • Second, separate fully crossed [P x R x O] designs for each disability group of 30 students. • Math assessments consisted of two tasks. Both a random [P x R x O x T] design and a fixed [P x R x O] design averaging over tasks were used. • A survey inquired about raters’ perceptions regarding bias in rating students with disabilities (their own and other raters’).
Generalizability ResultsNested Design: Language Arts [Score=Rater x Occasion x Person (Disability)]
Generalizability Results (continued)Nested Design: Mathematics [Score=Task x Rater x Occasion x Person (Disability)]
Generalizability Results(continued)Crossed Design by Disability: Language Arts [Score=Rater x Occasion x Person]
Generalizability Results (continued)Crossed Design by Disability: Mathematics [Score=Task x Rater x Occasion x Person]
Generalizability Results(continued)Crossed Design by Disability: Mathematics with Task facet fixed [Score=Person x Rater x Occasion, averaging over the two tasks]
RaterSurvey(continued)Mean Score of Raters on Self and Others Regarding Fairness and Bias on Scoring
Discussion Variance Components: • Person (P) component is always the largest (50% to 70% of variance across designs). However there still exists a good amount of measurement error (triple interaction, ignored facets). • Some differences exist between regular education and disability groups in terms of variance components
Discussion (continued) Differences between groups: • Total amount of variance is always less in the disability groups (more skewed distribution). • Variance due to persons (P) and therefore Dependability coefficients are lower for the disability group in Language Arts. This is also true in Mathematics if we use a fixed averaged task facet, but not with two random tasks.
Discussion(continued) Rater Bias: • No Rater (R) main effects. No leniency differences across raters. • No “rating occasion” (O) effect. Overall there is no bias introduced by rater knowledge of disability status. • No rater interactions with tasks or occasions.
Discussion(continued) • However, there is a non-negligible Person by Rater (PxR) interaction which is considerably larger for disability students. • This does not necessarily constitute bias but can still compromise validity of scores for accommodated students. • Are features in papers from students with disabilities differentially salient to different raters?
Discussion(continued) • There is a Large Person by Task (PxT) interaction in Math, but it is considerably smaller for students with disabilities: • Disability students may not be as aware of the different nature of the tasks so that this somehow natural interaction (Miller & Linn, 2000 and others) would show. • Accommodations may not be having the intended leveling effects. • With a random task facet the lower PxT interaction “increases reliability” for disability students.
Discussion(continued) From Rater Survey: • Teachers believe that there is a certain bias and unfairness from raters when scoring performance assessments from students with disabilities. • Raters see themselves as more fair and unbiased than the general population of raters. • Whether this is due to training, or to initially high self-perceptions is not clear. A not uncommon “I’m great but others aren’t as much” kind of effect could be the sole reason.
Future Directions and Questions • Are there different patterns for different kinds of disabilities/accommodations? • Are accommodations being used appropriately and having the intended effects? • Do patterns hold for raters at the local school sites who in general receive less training? • Does rater background influence the size and nature of these effects and interactions? • How does the testing occasion facet influence variance components/other interactions?