1 / 19

Combined Human and Automated Scoring of Writing

Combined Human and Automated Scoring of Writing. Stuart Kahl Measured Progress. The Challenges of Performance Tasks. Contention. The number of independent score points should reflect the amount of evidence. 1. Second reads (double scoring) 2. Additional independent scores 3. Both.

anne
Download Presentation

Combined Human and Automated Scoring of Writing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Combined Human and Automated Scoring of Writing Stuart Kahl Measured Progress

  2. The Challenges of Performance Tasks

  3. Contention The number of independent score points should reflect the amount of evidence.

  4. 1. Second reads (double scoring) 2. Additional independent scores 3. Both Uses of Automated Essay Scoring

  5. Are a couple human scores plus computer- generated trait scores “better than” many human analytic scores? 2. When humans focus on fewer traits, are their agreement rates higher? Questions

  6. From Grade 8 NECAP * 1 long and 1 short essay (based on 2 common prompts) and 10 MC responses from each of 1694 students. From Grade 11 NECAP * 2 long essays (based on 2 common prompts) from each of 590 students. The Student Work We Used

  7. MP/Gates human – 1 holistic, 5 traits (organization, support, focus, language, conventions), double- scored MP/Gates human – 1 holistic, 1 trait (support), double-scored Computer-generated trait scores (word choice, mechanics, style, organization, development) NECAP human – 1 holistic, double-scored The Essay Score Data

  8. Scorer agreement – discrepancy (>1) rates Decision accuracy – estimate of proportion of categorization decisions that would match decisions that would result if scores contained no measurement error Decision consistency – estimate of proportion of categorization decisions that would match decisions based on scores from a parallel form Standard error at cut points Statistics

  9. MP/Gates Scorer Agreement – # Discrepancies (>1)

  10. Decision Accuracy (and Consistency) – Grade 8

  11. Decision Accuracy (and Consistency) – Grade 8, continued

  12. Decision Accuracy (and Consistency) – Grade 11

  13. Standard Errors at Cuts – Grade 8

  14. Standard Errors at Cuts – Grade 8, continued

  15. Standard Errors at Cuts – Grade 11

  16. Primary The approach (human holistic + 5 traits vs human holistic + 1 trait + automated 5 traits) did not make a difference with respect to decision accuracy/consistency, but did with respect to standard error, the first approach associated with lower standard errors. Scorer discrepancy rates were lower when scorers evaluated fewer traits. Secondary The inclusion of MC items with student essays did not make a difference with respect to decision accuracy/consistency, but did reduce standard errors at the cuts. The addition of a second essay both improved decision accuracy/consistency and reduced standard errors at the cuts. Preliminary Findings

  17. investigate other score combinations relative to the ones we looked at, especially “holistic alone.” understand why approach (the ones investigated) and MC items made no difference with respect to decision accuracy/consistency, but did with respect to standard errors at the cuts. test significance. Still need to:

  18. Human holistic and limited analytic scores + “trained” automated holistic scores as second read and as check of human scores to determine need for arbitration + “untrained” automated analytic trait scores What Might Be

  19. It’s all about student learning. Period.

More Related