190 likes | 321 Views
Combined Human and Automated Scoring of Writing. Stuart Kahl Measured Progress. The Challenges of Performance Tasks. Contention. The number of independent score points should reflect the amount of evidence. 1. Second reads (double scoring) 2. Additional independent scores 3. Both.
E N D
Combined Human and Automated Scoring of Writing Stuart Kahl Measured Progress
Contention The number of independent score points should reflect the amount of evidence.
1. Second reads (double scoring) 2. Additional independent scores 3. Both Uses of Automated Essay Scoring
Are a couple human scores plus computer- generated trait scores “better than” many human analytic scores? 2. When humans focus on fewer traits, are their agreement rates higher? Questions
From Grade 8 NECAP * 1 long and 1 short essay (based on 2 common prompts) and 10 MC responses from each of 1694 students. From Grade 11 NECAP * 2 long essays (based on 2 common prompts) from each of 590 students. The Student Work We Used
MP/Gates human – 1 holistic, 5 traits (organization, support, focus, language, conventions), double- scored MP/Gates human – 1 holistic, 1 trait (support), double-scored Computer-generated trait scores (word choice, mechanics, style, organization, development) NECAP human – 1 holistic, double-scored The Essay Score Data
Scorer agreement – discrepancy (>1) rates Decision accuracy – estimate of proportion of categorization decisions that would match decisions that would result if scores contained no measurement error Decision consistency – estimate of proportion of categorization decisions that would match decisions based on scores from a parallel form Standard error at cut points Statistics
MP/Gates Scorer Agreement – # Discrepancies (>1)
Standard Errors at Cuts – Grade 8, continued
Primary The approach (human holistic + 5 traits vs human holistic + 1 trait + automated 5 traits) did not make a difference with respect to decision accuracy/consistency, but did with respect to standard error, the first approach associated with lower standard errors. Scorer discrepancy rates were lower when scorers evaluated fewer traits. Secondary The inclusion of MC items with student essays did not make a difference with respect to decision accuracy/consistency, but did reduce standard errors at the cuts. The addition of a second essay both improved decision accuracy/consistency and reduced standard errors at the cuts. Preliminary Findings
investigate other score combinations relative to the ones we looked at, especially “holistic alone.” understand why approach (the ones investigated) and MC items made no difference with respect to decision accuracy/consistency, but did with respect to standard errors at the cuts. test significance. Still need to:
Human holistic and limited analytic scores + “trained” automated holistic scores as second read and as check of human scores to determine need for arbitration + “untrained” automated analytic trait scores What Might Be