1 / 17

Statistical Analysis of Scorer Interrater Reliability

Statistical Analysis of Scorer Interrater Reliability. JENNA PORTER DAVID JELINEK SACRAMENTO STATE UNIVERSITY. Background & Overview. Piloted since 2003-2004 Began with our own version of training & calibration Overarching question: Scorer interrater reliability

Download Presentation

Statistical Analysis of Scorer Interrater Reliability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Analysis of Scorer Interrater Reliability JENNA PORTER DAVID JELINEK SACRAMENTO STATE UNIVERSITY

  2. Background & Overview • Piloted since 2003-2004 • Began with our own version of training & calibration • Overarching question: Scorer interrater reliability • i.e., Are we confident that the TE scores our candidates receive are consistent from one scorer to the next? • If not -- the reasons? What can we do out it? • Jenna: Analysis of data to get these questions • Then we’ll open it up to “Now what?” “Is there anything we can/should do about it?” “Like what?”

  3. Presentation Overview • Is there interrater reliability among our PACT scorers? • How do we know? • Two methods of analysis • Results • What do we do about it? • Implications

  4. Data Collection • 8 credential cohorts • 4 Multiple subject and 4 Single subject • 181 Teaching Events Total • 11 rubrics (excludes pilot Feedback rubric) • 10% randomly selected for double score • 10% TEs were failing and double scored • 38 Teaching Events were double scored (20%)

  5. Scoring Procedures • Trained and calibrated scorers • University Faculty • Calibrate once per academic year • Followed PACT Calibration Standard-Scores must: • Result in same pass/fail decision (overall) • Have exact agreement with benchmark at least 6 times • Be within 1 point of benchmark • All TEs scored independently once • If failing, scored by second scorer and evidence reviewed by chief trainer

  6. Methods of Analysis • Percent Agreement • Exact Agreement • Agreement within 1 point • Combined (Exact and within 1 point) • Cohen’s Kappa (Cohen, 1960) • Indicates percentage of agreement accounted for from raters above what is expected by chance Cohen (1960). Cohen, J. (1960). A coefficient of agreement for nominal scales, Educational and Psychological Measurement, 20 (1), pp.37–46.

  7. Percent Agreement • Benefits • Easy to Understand • Limitations • Does not account for chance agreement • Tends to overestimate true agreement (Berk, 1979; Grayson, 2001). • Berk, R. A. (1979). Generalizability of behavioral observations: A clarification of interobserver agreement and interobserver reliability. American Journal of Mental Deficiency, 83, 460-472. • Grayson, K. (2001). Interrater Reliability. Journal of Consumer Psychology, 10, 71-73.

  8. Cohen’s Kappa • Benefits • Accounts for chance agreement • Can be used to compare across different conditions (Ciminero, Calhoun, & Adams, 1986). • Limitations • Kappa may decrease if low base rates, so need at least 10 occurances (Nelson and Cicchetti, 1995) • Ciminero, A. R., Calhoun, K. S., & Adams, H. E. (Eds.). (1986). Handbook of behavioral assessment (2nd ed.). New York: Wiley. • Nelson, L. D., & Cicchetti, D. V. (1995). Assessment of emotional functioning in brain impaired individuals. Psychological Assessment, 7, 404–413.

  9. Kappa coefficient • Kappa= Proportion of observed agreement – chance agreement 1- chance agreement • Coefficient ranges from -1.0 (disagreement) to 1.0 (perfect agreement) Altman, D.G. (1991). Practical Statistics for Medical Research. London: Chapman and Hall. Fleiss, J. L. (1981). Statistical methods for rates and proportions. NY:Wiley. Landis, J. R., Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics 33:159-174.

  10. Percent Agreement

  11. Percent Agreement

  12. Pass/Fail Disagreement

  13. Cohen’s Kappa

  14. Interrater Reliability Compared

  15. Implications • Overall interrater reliability poor to fair • Consider/reevaluate protocol for calibration • Calibration protocol may be interpreted differently • Drifting scorers? • Use multiple methods to calculate interrater reliability • Other?

  16. How can we increase interrater reliability? • Your thoughts . . . • Training protocol • Adding “Evidence Based” Training- Jeanne Stone, UCI • More calibration

  17. Contact Information • Jenna Porter • jmporter@csus.edu • David Jelinek • djelinek@csus.edu

More Related