1 / 27

Automated Scoring for Speaking Assessments

Explore the evolution of AZELLA's Speaking Test, from manual scoring to automated methods, enhancing consistency and efficiency. Learn about the challenges faced and the proposed solutions.

bedard
Download Presentation

Automated Scoring for Speaking Assessments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automated Scoring for Speaking Assessments Arizona English Language Learner Assessment Irene Hunting - Arizona Department of Education Yuan D’Antilio - Pearson Erica Baltierra - Pearson June 24, 2015

  2. Arizona English Language Learner AssessmentAZELLA • AZELLA is Arizona’s own English Language Proficient Assessment. • AZELLA has been in use since school year 2006-2007. • Arizona revised its English Language Proficiency (ELP) Standards due to the adoption of the Arizona College and Career Ready Standards in 2010. • AZELLA had to be revised to align with the new ELP Standards. • Arizona not only revised the alignment of AZELLA but also revised administration practices and procedures. • Revisions to the Speaking portion of the AZELLA are particularly notable.

  3. AZELLA Speaking Test AdministrationPrior to School Year 2012-2013 • Administered orally by test administrator • One-on-one administration • Scored by test administrator • Immediate scores • Training for test administrators • Minimal • Not required

  4. AZELLA Speaking Test ConcernsPrior to School Year 2012-2013 • Inconsistent test administration • Not able to standardize test delivery • Inconsistent scoring • Not able to replicate or verify scoring

  5. AZELLA Speaking Test DesiresFor School Year 2012-2013 and beyond • Consistent test administration • Every student has the same testing experience • Consistent and quick scoring • Record student responses • Reliability statistics for scoring • Minimal burden for schools • No special equipment • No special personnel requirements or trainings • Similar amount of time to administer

  6. AZELLA Speaking Test AdministrationFor School Year 2012-2013 and beyond • Consistent test administration • Administered one-on-one via speaker telephone • Consistent and quick scoring • Student responses are recorded • Reliable machine scoring • Minimal burden for schools • Requires a landline speaker telephone • No special personnel requirements or training • Slightly longer test administration time

  7. Proposed Solution

  8. Development of Automated Scoring Method Human raters Field testing data Testing System Validation Automated Scores Human Transcribers Recorded Items Item Text TestDevelopers Test Spec

  9. Why does automated scoring of speaking work? • The acoustic models used for speech recognition are optimized for various accents • Young children speech, foreign accents • The test questions have been modeled from field test data • The system anticipates the various ways that students respond

  10. FieldTestedItems The test questions have been modeled from field test data – the system anticipates the various ways that students respond e.g. “What is in the picture?”

  11. Languagemodels a It’s protractor I don’t know protractor a compass

  12. Field Testing and Data Preparation Two Field Testing: 2011-2012 Number of students: 31,685 (1st -12th grade), 13,141 (Kindergarten)

  13. Item Type for Automated Scoring

  14. Sample Speaking Rubric: 0 – 4 Point Item

  15. Sample student responses first you wake up and then you put on your clothes # and eat breakfast 3 3.35

  16. Validity evidence: Are machine scores comparable to human scores? Measures we looked at: • Reliability (internal consistency) • Candidate-level (or test-level) correlations • Item-level correlations

  17. Structural reliability

  18. Scatterplot by Stage Stage II Stage III Stage IV Stage V

  19. Item-level performance: by item type

  20. Item-level performance: by item type

  21. Summary of Score Comparability Machine-generated scores are comparable to human ratings • Reliability (internal consistency) • Test-level correlations • Item-type-level correlations

  22. Test Administration • Preparation • One-on-one practice – student and test administrator • Demonstration Video • Landline Speaker Telephone for one-on-one administration • Student Answer Document – Unique Speaking Test Code

  23. Test Administration

  24. Test Administration • Warm Up Questions • What is your first and last name? • What is your teacher’s name? • How old are you? • Purpose of the Warm Up Questions • Student becomes more familiar with prompting • Sound check for student voice level, equipment • Capture Demographic data to resolve future inquiries • Responses are not scored

  25. Challenges Landline Speaker telephone availability • ADE purchased speaker telephones for the first year of administration Difficulty scoring young population • Additional warm up questions • Added beeps to prompt the student to respond • Adjusting acceptable audio threshold • Rubric Update and Scoring Engine Recalibration • Captured demographics from warm up questions • Speaking code key entry process updated • Documentation of test administrator name and time of administration Incorrect Speaking Codes

  26. Summary • Automated delivery and scoring of speaking assessment is highly reliable solution for large-volume state assessments • Standardize test delivery • Minimal test set-up and training is required • Consistent in scoring • Availability of test data for analysis and review

  27. Questions

More Related