270 likes | 295 Views
Explore the evolution of AZELLA's Speaking Test, from manual scoring to automated methods, enhancing consistency and efficiency. Learn about the challenges faced and the proposed solutions.
E N D
Automated Scoring for Speaking Assessments Arizona English Language Learner Assessment Irene Hunting - Arizona Department of Education Yuan D’Antilio - Pearson Erica Baltierra - Pearson June 24, 2015
Arizona English Language Learner AssessmentAZELLA • AZELLA is Arizona’s own English Language Proficient Assessment. • AZELLA has been in use since school year 2006-2007. • Arizona revised its English Language Proficiency (ELP) Standards due to the adoption of the Arizona College and Career Ready Standards in 2010. • AZELLA had to be revised to align with the new ELP Standards. • Arizona not only revised the alignment of AZELLA but also revised administration practices and procedures. • Revisions to the Speaking portion of the AZELLA are particularly notable.
AZELLA Speaking Test AdministrationPrior to School Year 2012-2013 • Administered orally by test administrator • One-on-one administration • Scored by test administrator • Immediate scores • Training for test administrators • Minimal • Not required
AZELLA Speaking Test ConcernsPrior to School Year 2012-2013 • Inconsistent test administration • Not able to standardize test delivery • Inconsistent scoring • Not able to replicate or verify scoring
AZELLA Speaking Test DesiresFor School Year 2012-2013 and beyond • Consistent test administration • Every student has the same testing experience • Consistent and quick scoring • Record student responses • Reliability statistics for scoring • Minimal burden for schools • No special equipment • No special personnel requirements or trainings • Similar amount of time to administer
AZELLA Speaking Test AdministrationFor School Year 2012-2013 and beyond • Consistent test administration • Administered one-on-one via speaker telephone • Consistent and quick scoring • Student responses are recorded • Reliable machine scoring • Minimal burden for schools • Requires a landline speaker telephone • No special personnel requirements or training • Slightly longer test administration time
Development of Automated Scoring Method Human raters Field testing data Testing System Validation Automated Scores Human Transcribers Recorded Items Item Text TestDevelopers Test Spec
Why does automated scoring of speaking work? • The acoustic models used for speech recognition are optimized for various accents • Young children speech, foreign accents • The test questions have been modeled from field test data • The system anticipates the various ways that students respond
FieldTestedItems The test questions have been modeled from field test data – the system anticipates the various ways that students respond e.g. “What is in the picture?”
Languagemodels a It’s protractor I don’t know protractor a compass
Field Testing and Data Preparation Two Field Testing: 2011-2012 Number of students: 31,685 (1st -12th grade), 13,141 (Kindergarten)
Sample student responses first you wake up and then you put on your clothes # and eat breakfast 3 3.35
Validity evidence: Are machine scores comparable to human scores? Measures we looked at: • Reliability (internal consistency) • Candidate-level (or test-level) correlations • Item-level correlations
Scatterplot by Stage Stage II Stage III Stage IV Stage V
Summary of Score Comparability Machine-generated scores are comparable to human ratings • Reliability (internal consistency) • Test-level correlations • Item-type-level correlations
Test Administration • Preparation • One-on-one practice – student and test administrator • Demonstration Video • Landline Speaker Telephone for one-on-one administration • Student Answer Document – Unique Speaking Test Code
Test Administration • Warm Up Questions • What is your first and last name? • What is your teacher’s name? • How old are you? • Purpose of the Warm Up Questions • Student becomes more familiar with prompting • Sound check for student voice level, equipment • Capture Demographic data to resolve future inquiries • Responses are not scored
Challenges Landline Speaker telephone availability • ADE purchased speaker telephones for the first year of administration Difficulty scoring young population • Additional warm up questions • Added beeps to prompt the student to respond • Adjusting acceptable audio threshold • Rubric Update and Scoring Engine Recalibration • Captured demographics from warm up questions • Speaking code key entry process updated • Documentation of test administrator name and time of administration Incorrect Speaking Codes
Summary • Automated delivery and scoring of speaking assessment is highly reliable solution for large-volume state assessments • Standardize test delivery • Minimal test set-up and training is required • Consistent in scoring • Availability of test data for analysis and review