1 / 24

Automatic Assessment of the Speech of Young English Learners

Explore the development and validation of automatic speech scoring methods for the speaking part of the Arizona English Language Learner Assessment (AZELLA) test. The research focuses on item type analysis, data collection, human transcriptions and scoring methods, and machine scoring techniques. Various models are utilized, from ASR to content and spectral modeling, to predict human holistic scores accurately. This study aims to enhance assessment efficiency for English learners.

dianak
Download Presentation

Automatic Assessment of the Speech of Young English Learners

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Assessment of the Speech of Young English Learners • Jian Cheng, Yuan Zhao D’Antilio, Xin Chen, Jared Bernstein Knowledge Technologies, Pearson Menlo Park, California, USA BEA-2014

  2. Overview • Introduction • Item type analysis • Data • Human transcriptions and scoring • Machine scoring methods • Experimental results • Unscorable test detection • Future work • Conclusions

  3. Introduction • Arizona English Language Learner Assessment (AZELLA) is an English Learners (ELs) test administrated in the state of Arizona for K-12 students by Arizona Department of Education (ADE). • Five stages: K., Elementary, Primary, Middle and High School. • AZELLA is a four skills test. This research focuses on speaking part to generate scores automatically. • The first field test (stage 2-5) took place around Nov. 2011. Pearson Knowledge Technologies (PKT) delivered over 31K tests. The second field test (stage 1) took place around April 2012. PKT delivered over 13K tests. • The first operational AZELLA test with automatic speech scoring took place between January and February, 2013, with approximately 140K tests delivered. After that, annually PKT are supposed to deliver around 180k tests.

  4. Item type analysis Constrained item types: • Naming • Read syllables for one word • Read three words sequence • Repeat Fairly unconstrained item types: • Questions about image • Give directions from map • Ask questions about a thing • Open questions about a topic • Give instructions to do something • Similarities & differences • Ask questions about a statement • Detailed response to a topic

  5. Data • From the data in the first field test (Stages 2-5), for each AZELLA stage, we randomly sampled 300 tests (75 tests/form x 4 forms) as a validation set and 1,200 tests as a development set. For the data in the second field test (Stage 1), we randomly sampled 167 tests from the four forms as the validation set and 1,200 tests as the development set. • No validation data was used for model training.

  6. Human transcriptions and scoring • In the development sets, we needed from 100 to 300 responses per item to be transcribed, depending on the complexity of the item type. • All responses from the tests were scored by trained professional raters according to predefined ADE rubrics. • Every response has one trait: human holistic score. • We used the average score from different raters as the final score during machine learning. • The responses in each validation set were double rated (producing two final scores) for use in validation. • For the responses of open-ended item types, AZELLA holistic score rubrics require to consider both the content and the manner of speaking used in the response.

  7. Machine scoring methods • We used different features (content and manner) derived from speech to predict the final human holistic score. • ASR (Automatic Speech Recognition) • Acoustic models • Language models • Content modeling • Duration modeling • Spectral modeling • Confidence modeling • Final models

  8. Machine scoring methods- Content Modeling • Content indicates how well the test-taker understood the prompt and could respond with appropriate linguistic content. • has_keywords: the occurrence of correct sequence of syllables or words. • word_errors: the minimum number of substitutions, deletions, and/or insertions required to find a best string match in the response to the answer choices. • word_vector: scaling the weighted sum of the occurrence of a large set of expected words and word sequences that may be recognized in the spoken response. Weights are assigned to the expected words and word sequences according to their relation to the good responses using LSA. It was done automatically.

  9. Machine scoring methods- Duration modeling • It can catch if test-takers produce the correct duration for different phonemes. • The duration statistics models were built from native data from an unrelated test called the Versant Junior English Test. • The statistics of the phoneme durations of native responses were stored as non-parametric cumulative density functions (CDFs). • Duration statistics from native speakers were used to compute the log likelihood for durations of phonemes produced by candidates. If enough samples for a phoneme in a specific word existed, we built a unique duration model for this phoneme in context. • All phones vs. pause

  10. Machine scoring methods- Spectral modeling • To consider manner scoring more than duration, we computed few spectral likelihood features according to native and learner segment models applied to the recognition alignment of segmental units. • We did force alignment of the utterance on the word string from the recognized sentence using the native mono acoustic model. • For every phoneme, using the previous time boundary constrain from the native mono acoustic model, we did an allphone recognition using the native mono acoustic model again. • Different features by using different interested phonemes. • ppm: the percentage of phonemes from the allphone recognition matching to the phonemes from the force alignment.

  11. Machine scoring methods- Confidence modeling • After finishing speech recognition, we can assign speech confidence scores to words and phonemes. Then for every response, we may compute the average confidence, the percentage of words or phonemes whose confidences are lower than a threshold value as features to predict test-takers' performance.

  12. Machine scoring methods- Final models • Features word_vector, has_keywords, word_errors, percent_correctcan effectively define content scores based on what is spoken. • Features log_seg_prob, iw_log_seg_prob, spectral_1, spectral_2 can effectively define both the rhythmic and segmental aspects of the performance to be native likelihoods of producing the observed base physical measures. • By combining these features together, we can predict effectively human's holistic scores. • PKT tried both simple multiple linear regression models and neural network models and selected the best models. In most of the cases, the neural network models had better performances.

  13. Experimental results • Distribution of average human holistic score of participants in the validation set for Stage 5 (Grade 9-12) • All results presented here are validation results used the validation sets. The models built knew nothing about the validation sets.

  14. Experimental results – Stage I

  15. Experimental results – Stage II, III, IV, V (item level)

  16. Experimental results – Stage II, III, IV, V (participant level)

  17. Experimental results:

  18. Experimental results – Test reliability by stage

  19. Unscorable test detection • There are several outliers that the machine scores were significant lower than human scores. The main reason is basically low Signal-to-Noise Ratio (SNR), either the background noise was so high, or speech voice was low (low volume recordings made by shy kids). For those cases, it is hard for ASR. The solution could be filtering these calls out and pass them to human grading. • We identified features to deal with low-volume tests: maximum energy, the number of frames with fundamental frequency, etc., plus many features mentioned in Cheng and Shen (2011) to build a unscorable test detector. • More details in the poster this afternoon: Angeliki Metallinou, Jian Cheng, "Syllable and language model based features for detecting non-scorable tests in spoken language proficiency assessment applications”

  20. Future work • We may train a better native acoustic model that uses more native data from AZELLA project after we got the demographic information for test-takers. • We may catch soft or noise calls automatically to exclude them from machine grading. • For repeat, we used simple average as the final scores. We may use a partial credit Rasch model to improve the performance. • The current items in forms didn't go through a post screening process, if we only select the items that have the best prediction power to the test forms, the correlations could be improved. • Some kids speak significantly soft. This problem should be fixed. • Apply deep neural network (DNN) acoustic models instead of traditional GMM-HMMs to achieve better performance.

  21. Angeliki Metallinou, Jian Cheng, "Using Deep Neural Networks to Improve Proficiency Assessment for Children English Language Learners”, to appear in Interspeech, September 2014, Singapore. Target on AZELLA Stage II data: • Experimental results show that the DNN-based recognition approach achieved 31% relative WER reduction when compared to GMM-HMMs. • Averaged item-type level correlation increased from 0.772 (the result in this paper) to 0.795 (new GMM-HMMs) to 0.826 (DNN-HMMs), which is a 0.054 absolute improvement.

  22. Post Validation Studies After this study, we went through several post validation studies, our customer (Arizona Department of Education) is happy about these results.

  23. Conclusions • We considered both what the student says and the way in which the student speaks to generate the final holistic scores. • We provided validity evidence for machine-generated scores. The average human-machine correlation 0.92. • The assessments include 10 open-ended item types. For 9 of the 10 open item types, machine scoring performed at a similar level human scoring at the item-type level. • We described the design, implementation and evaluation of a detector to catch problematic, unscorable tests. • Automatic assessment of the speech of young English learners works. It works well.

More Related