Combined Human and Automated Scoring of Writing

Combined Human and Automated Scoring of Writing Stuart Kahl Measured Progress

The Challenges of Performance Tasks

Contention The number of independent score points should reflect the amount of evidence.

1. Second reads (double scoring) 2. Additional independent scores 3. Both Uses of Automated Essay Scoring

Are a couple human scores plus computer- generated trait scores “better than” many human analytic scores? 2. When humans focus on fewer traits, are their agreement rates higher? Questions

From Grade 8 NECAP * 1 long and 1 short essay (based on 2 common prompts) and 10 MC responses from each of 1694 students. From Grade 11 NECAP * 2 long essays (based on 2 common prompts) from each of 590 students. The Student Work We Used

MP/Gates human – 1 holistic, 5 traits (organization, support, focus, language, conventions), double- scored MP/Gates human – 1 holistic, 1 trait (support), double-scored Computer-generated trait scores (word choice, mechanics, style, organization, development) NECAP human – 1 holistic, double-scored The Essay Score Data

Scorer agreement – discrepancy (>1) rates Decision accuracy – estimate of proportion of categorization decisions that would match decisions that would result if scores contained no measurement error Decision consistency – estimate of proportion of categorization decisions that would match decisions based on scores from a parallel form Standard error at cut points Statistics

MP/Gates Scorer Agreement – # Discrepancies (>1)

Decision Accuracy (and Consistency) – Grade 8

Decision Accuracy (and Consistency) – Grade 8, continued

Decision Accuracy (and Consistency) – Grade 11

Standard Errors at Cuts – Grade 8

Standard Errors at Cuts – Grade 8, continued

Standard Errors at Cuts – Grade 11

Primary The approach (human holistic + 5 traits vs human holistic + 1 trait + automated 5 traits) did not make a difference with respect to decision accuracy/consistency, but did with respect to standard error, the first approach associated with lower standard errors. Scorer discrepancy rates were lower when scorers evaluated fewer traits. Secondary The inclusion of MC items with student essays did not make a difference with respect to decision accuracy/consistency, but did reduce standard errors at the cuts. The addition of a second essay both improved decision accuracy/consistency and reduced standard errors at the cuts. Preliminary Findings

investigate other score combinations relative to the ones we looked at, especially “holistic alone.” understand why approach (the ones investigated) and MC items made no difference with respect to decision accuracy/consistency, but did with respect to standard errors at the cuts. test significance. Still need to:

Human holistic and limited analytic scores + “trained” automated holistic scores as second read and as check of human scores to determine need for arbitration + “untrained” automated analytic trait scores What Might Be

It’s all about student learning. Period.

Combined Human and Automated Scoring of Writing

Combined Human and Automated Scoring of Writing

Presentation Transcript

A Human-Computer Collaboration Approach to Improve Accuracy of an Automated English Scoring System

Automated Scoring for Next Generation Assessments

Automated Essay Scoring for Swedish

Scoring and Reporting of Results

The Machine Scoring of Essays: Redefining Writing Pedagogy?

Automated Scoring is a Policy and Psychometric Decision

Scoring FCAT Writing Bistrican | English II

Writing Assessment Scoring Guides

Administration and Scoring of:

Evaluating, Scoring, and Writing Comments for Results Items

Automated Analysis of Human Factors Requirements

AUTOMATED ANALYSIS OF HUMAN FACTORS REQUIREMENTS

Missouri’s Experience with Automated Scoring

Automated Scoring of Open-ended Ethics Questions

Scoring Student Writing

TUSD Scoring Extended Writing

Tool for Writing Automated Tests

The basics of high scoring in your assignment writing

Automated Musical Part Writing