210 likes | 231 Views
This study explores automated speech scoring systems for language fluency assessment, focusing on SpeechRater and an alternative method called autorater. The text discusses the challenges of measuring fluency and the criteria for effective assessment. The SpeechRater architecture, performance metrics, and the autorater approach are detailed, highlighting the correlation between human-rated scores and automated fluency measures. Results from experiments demonstrate the effectiveness of low-level acoustic measurements in assessing fluency and the potential application of logistic regression models in automated scoring of spontaneous speech. This innovative approach presents a valuable alternative for fluency assessment, particularly in resource-scarce testing environments.
E N D
Automatic Fluency Assessment Suma Bhat
The Problem • Language fluency • Component of oral proficiency • Indicative of effort of speech production • Indicates effectiveness of speech • Language proficiency testing • Automated methods of language assessment • Fundamental importance • Automatic assessment of language fluency
Why is it hard? • Fluency a subjective quantity • Measurement of fluency requires • Choice of right quantifiers • Means of measuring the quantifiers • Automatic scores should • Correlate well with human assessment • Interpretable
Automatic Speech Scoring • Automatic scoring of predictable speech • factual information in short answers (Leacock & Chodorow, 2003) • read speech • PhonePass (Bernstein, 1999) • Automatic scoring of unpredictable speech • spontaneous speech • SpeechRater (Zechner, 2009)
State of the art • SpeechRater from Educational Testing Services (2008, 2009) • Uses ASR for automatic assessment of English speaking proficiency • In use as online practice test for TOEFL internet based test (iBT) takers since 2006
Proficiency assessment in SpeechRater • Test aspects of language competence • Delivery (fluency, pronunciation) • Language use (vocabulary and grammar) • Topical development (content, coherence and organization) • Current system • Scores fluency and language use • Overall proficiency score • Combination of measures of fluency and language use • Multiple Regression and CART scoring module
System • Speech recognizer • Trained on 40 hours of non-native speech • Evaluation set 10 hours of non-native speech • Word accuracy 50% • Feature set • Fluency features • Mean silence duration, Articulation rate • Vocabulary • Word types per second • Pronunciation • Global acoustic model score • Grammar • Global language model score
Performance • Measured in Human-Computer score correlation • Multiple Regression based scoring 0.57 • CART based scoring 0.57 • Compared with inter-human agreement 0.74
Requirements • Superior quality audio recordings for ASR training • tens of hours of language specific speech • tens of hours of transcription Language-specific resources
Is this the end? • What if language-specific resources are scarce? • superior quality audio recordings for ASR training • hours of language specific speech • hours of transcription • Tested language is a minority language ASR performance affected Alternative methods sought
Alternative method • Our approach (autorater) • makes signal level measurements to obtain quantifiers of fluency • Constructs classifier based on 20 second-segments of speech • Requires no transcription
Autorater Speech signal Preprocessor Feature Extractor Classifier Fluency score Scorer
Measurements • Convert stereo to mono • Downsample to 16kbps • Extract pitch and intensity information • Segment signal into speech and silence • Feature extraction • Used Praat Using sox
dur1=duration of speech without silent pauses dur2= total duration of speech Feature Extractor
Classifier • Logistic regression model • Target scores: Human-rated scores • Variables: Measurements of the quantifiers • PTR, ROS, MLS • Observed scores: Real values between 0 and 1 (inclusive)
Experiments • 3 configurations of the classifier • Rater-independent model: • Most general form • Rated utterances are considered independent of the raters • Does not take into account individual rater bias • Rater-biased model: • additional binary features equal to the number of raters • indicates individual rater-bias • Rater-tuned model: • One model per rater
Results –Pilot rating (Trained) • R-tuned: • Best model 0.197worstmodel 0.5 • Inter-rater agreement 48.2%
Summary • Quantifiers obtained from low-level acoustic measurements are good indicators of fluency • Logistic regression models for automated scoring of spontaneous speech appropriate • Main contribution • Alternative method of automatic fluency assessment • Useful in resource-scarce testing • Main Result: • Rater-biased logistic regression model for scoring fluency