Automated Scoring for Speaking Assessments

Automated Scoring for Speaking Assessments Arizona English Language Learner Assessment Irene Hunting - Arizona Department of Education Yuan D’Antilio - Pearson Erica Baltierra - Pearson June 24, 2015

Arizona English Language Learner AssessmentAZELLA • AZELLA is Arizona’s own English Language Proficient Assessment. • AZELLA has been in use since school year 2006-2007. • Arizona revised its English Language Proficiency (ELP) Standards due to the adoption of the Arizona College and Career Ready Standards in 2010. • AZELLA had to be revised to align with the new ELP Standards. • Arizona not only revised the alignment of AZELLA but also revised administration practices and procedures. • Revisions to the Speaking portion of the AZELLA are particularly notable.

AZELLA Speaking Test AdministrationPrior to School Year 2012-2013 • Administered orally by test administrator • One-on-one administration • Scored by test administrator • Immediate scores • Training for test administrators • Minimal • Not required

AZELLA Speaking Test ConcernsPrior to School Year 2012-2013 • Inconsistent test administration • Not able to standardize test delivery • Inconsistent scoring • Not able to replicate or verify scoring

AZELLA Speaking Test DesiresFor School Year 2012-2013 and beyond • Consistent test administration • Every student has the same testing experience • Consistent and quick scoring • Record student responses • Reliability statistics for scoring • Minimal burden for schools • No special equipment • No special personnel requirements or trainings • Similar amount of time to administer

AZELLA Speaking Test AdministrationFor School Year 2012-2013 and beyond • Consistent test administration • Administered one-on-one via speaker telephone • Consistent and quick scoring • Student responses are recorded • Reliable machine scoring • Minimal burden for schools • Requires a landline speaker telephone • No special personnel requirements or training • Slightly longer test administration time

Proposed Solution

Development of Automated Scoring Method Human raters Field testing data Testing System Validation Automated Scores Human Transcribers Recorded Items Item Text TestDevelopers Test Spec

Why does automated scoring of speaking work? • The acoustic models used for speech recognition are optimized for various accents • Young children speech, foreign accents • The test questions have been modeled from field test data • The system anticipates the various ways that students respond

FieldTestedItems The test questions have been modeled from field test data – the system anticipates the various ways that students respond e.g. “What is in the picture?”

Languagemodels a It’s protractor I don’t know protractor a compass

Field Testing and Data Preparation Two Field Testing: 2011-2012 Number of students: 31,685 (1st -12th grade), 13,141 (Kindergarten)

Item Type for Automated Scoring

Sample Speaking Rubric: 0 – 4 Point Item

Sample student responses first you wake up and then you put on your clothes # and eat breakfast 3 3.35

Validity evidence: Are machine scores comparable to human scores? Measures we looked at: • Reliability (internal consistency) • Candidate-level (or test-level) correlations • Item-level correlations

Structural reliability

Scatterplot by Stage Stage II Stage III Stage IV Stage V

Item-level performance: by item type

Summary of Score Comparability Machine-generated scores are comparable to human ratings • Reliability (internal consistency) • Test-level correlations • Item-type-level correlations

Test Administration • Preparation • One-on-one practice – student and test administrator • Demonstration Video • Landline Speaker Telephone for one-on-one administration • Student Answer Document – Unique Speaking Test Code

Test Administration

Test Administration • Warm Up Questions • What is your first and last name? • What is your teacher’s name? • How old are you? • Purpose of the Warm Up Questions • Student becomes more familiar with prompting • Sound check for student voice level, equipment • Capture Demographic data to resolve future inquiries • Responses are not scored

Challenges Landline Speaker telephone availability • ADE purchased speaker telephones for the first year of administration Difficulty scoring young population • Additional warm up questions • Added beeps to prompt the student to respond • Adjusting acceptable audio threshold • Rubric Update and Scoring Engine Recalibration • Captured demographics from warm up questions • Speaking code key entry process updated • Documentation of test administrator name and time of administration Incorrect Speaking Codes

Summary • Automated delivery and scoring of speaking assessment is highly reliable solution for large-volume state assessments • Standardize test delivery • Minimal test set-up and training is required • Consistent in scoring • Availability of test data for analysis and review

Questions

Automated Scoring for Speaking Assessments

Automated Scoring for Speaking Assessments

Presentation Transcript

Innovative Speaking Assessments

Developing Monitoring and Pre-Scoring Plans for Alternate/Alternative Assessments

Florida Assessments for Instruction in Reading K-2 Electronic Scoring Tool

Automated Scoring for Next Generation Assessments

Scoring Guide for Whining

Scoring:

Scoring

Automated Essay Scoring for Swedish

Automated Scoring is a Policy and Psychometric Decision

SCORING

Speaking for Myself

Combined Human and Automated Scoring of Writing

Speaking Sample Items for Scoring Practice

Speaking for Excellence

Credit Scoring for Microfinance

Scoring

Missouri’s Experience with Automated Scoring

Automated Scoring of Open-ended Ethics Questions

Speaking assessments

Scoring

Know About the PTE Speaking Test Scoring Criteria