1 / 21

Rating Performance Assessments of Students With and Without Disabilities: A Generalizability Study of Teacher Bias

Rating Performance Assessments of Students With and Without Disabilities: A Generalizability Study of Teacher Bias. Jose-Felipe Martinez-Fernandez Ann M. Mastergeorge.

bonita
Download Presentation

Rating Performance Assessments of Students With and Without Disabilities: A Generalizability Study of Teacher Bias

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rating Performance Assessments of Students With and Without Disabilities: A Generalizability Study of Teacher Bias Jose-Felipe Martinez-Fernandez Ann M. Mastergeorge UCLA Graduate School of Education & Information StudiesCenter for the Study of EvaluationNational Center for Research on Evaluation, Standards, and Student Testing American Educational Research Association New Orleans April 1-5, 2001

  2. Introduction • Performance assessments are increasingly popular methods for the evaluation of academic performance. • A number of studies have shown that well trained raters can be reliable scorers of performance assessments for the general population of students. • This study addressed whether any bias exists from trained raters when scoring performance assessments of students with disabilities.

  3. Purpose • Compare the sources of score variability for students with and without disabilities in Language Arts and Mathematics performance assessments. • Determine if important differences exist across student groups in terms of variance components, and if so whether rater (teacher) bias plays a role. • Complement results with raters’ perceptions on bias (their own and other’s).

  4. Method • Student and Rater samples come from a larger district-wide validation study involving thousands of performance assessments. • Teachers from each grade and content area were trained as Raters. • A total of 6 studies (each with different raters and students) were performed for 3rd , 7th and 9th grade assessments in Language Arts and Mathematics.

  5. Method(continued) • For each study, 60 assessments (30 from regular education students and 30 from students who received some kind of accommodation) were rated by 4 raters in two occasions. • Raters were aware of each student’s disability status only in the 2ndrating occasion. Bias is defined as systematic differences in the scores across occasions. • No practice or memory effects expected. • Score scale ranges from 1 to 4.

  6. Method(continued) • Two kinds of Generalizability designs: First a “nested-within-disability” design with all 60 students [P(D) x R x O]. • Second, separate fully crossed [P x R x O] designs for each disability group of 30 students. • Math assessments consisted of two tasks. Both a random [P x R x O x T] design and a fixed [P x R x O] design averaging over tasks were used. • A survey inquired about raters’ perceptions regarding bias in rating students with disabilities (their own and other raters’).

  7. Score Distributions

  8. Generalizability ResultsNested Design: Language Arts [Score=Rater x Occasion x Person (Disability)]

  9. Generalizability Results (continued)Nested Design: Mathematics [Score=Task x Rater x Occasion x Person (Disability)]

  10. Generalizability Results(continued)Crossed Design by Disability: Language Arts [Score=Rater x Occasion x Person]

  11. Generalizability Results (continued)Crossed Design by Disability: Mathematics [Score=Task x Rater x Occasion x Person]

  12. Generalizability Results(continued)Crossed Design by Disability: Mathematics with Task facet fixed [Score=Person x Rater x Occasion, averaging over the two tasks]

  13. Rater SurveyRater Perceptions (** p<.01. N=40)

  14. RaterSurvey(continued)Mean Score of Raters on Self and Others Regarding Fairness and Bias on Scoring

  15. Discussion Variance Components: • Person (P) component is always the largest (50% to 70% of variance across designs). However there still exists a good amount of measurement error (triple interaction, ignored facets). • Some differences exist between regular education and disability groups in terms of variance components

  16. Discussion (continued) Differences between groups: • Total amount of variance is always less in the disability groups (more skewed distribution). • Variance due to persons (P) and therefore Dependability coefficients are lower for the disability group in Language Arts. This is also true in Mathematics if we use a fixed averaged task facet, but not with two random tasks.

  17. Discussion(continued) Rater Bias: • No Rater (R) main effects. No leniency differences across raters. • No “rating occasion” (O) effect. Overall there is no bias introduced by rater knowledge of disability status. • No rater interactions with tasks or occasions.

  18. Discussion(continued) • However, there is a non-negligible Person by Rater (PxR) interaction which is considerably larger for disability students. • This does not necessarily constitute bias but can still compromise validity of scores for accommodated students. • Are features in papers from students with disabilities differentially salient to different raters?

  19. Discussion(continued) • There is a Large Person by Task (PxT) interaction in Math, but it is considerably smaller for students with disabilities: • Disability students may not be as aware of the different nature of the tasks so that this somehow natural interaction (Miller & Linn, 2000 and others) would show. • Accommodations may not be having the intended leveling effects. • With a random task facet the lower PxT interaction “increases reliability” for disability students.

  20. Discussion(continued) From Rater Survey: • Teachers believe that there is a certain bias and unfairness from raters when scoring performance assessments from students with disabilities. • Raters see themselves as more fair and unbiased than the general population of raters. • Whether this is due to training, or to initially high self-perceptions is not clear. A not uncommon “I’m great but others aren’t as much” kind of effect could be the sole reason.

  21. Future Directions and Questions • Are there different patterns for different kinds of disabilities/accommodations? • Are accommodations being used appropriately and having the intended effects? • Do patterns hold for raters at the local school sites who in general receive less training? • Does rater background influence the size and nature of these effects and interactions? • How does the testing occasion facet influence variance components/other interactions?

More Related