240 likes | 529 Views
Using the HTK speech recogniser to analyse prosody in a corpus of German spoken learners ’ English. Toshifumi Oba, Eric Atwell University of Leeds, School of Computing tosh@comp.leeds.ac.uk eric@comp.leeds.ac.uk. Outline. Introduction Intonation and Speech Recognition
E N D
Using the HTK speech recogniser to analyse prosody in a corpus of German spoken learners’ English Toshifumi Oba, Eric Atwell University of Leeds, School of Computing tosh@comp.leeds.ac.uk eric@comp.leeds.ac.uk
Outline • Introduction • Intonation and Speech Recognition • Tendency of Speech Recognition Research • ISLE Speech Corpus • HTK Hidden Markov Model Toolkit • Prosodic Annotation • Human Evalution of Intonation Abilities • Grouping of German Speakers by Intonation Ability • HTK speech recognition experiments • Conclusions • Q & A
Intonation and Speech Recognition • Intonation is important in Human Communication. • Convey the meaning and attitude of the speaker • Intonation is important for Speech Recognition. • Acoustic Models (duration, F0, intensity) • Language Models (identify the dialogue type)
Tendency of Speech Recognition Research • Intonation << Pronunciation • Non-native speaker << Native speaker → Speech recognition research for non-native speakers’ intonation is unique. Also, • Intonation is paid less attention in CALL compared with pronunciation.
Objectives • Analysis of non-native speakers’ English intonation. • If the HTK is able to distinguish intonation ? • Is it possible to train distinct models for different intonation ability groups? • Prosodic annotation of written English text to produce ‘model’intonation patterns. • Human evaluation to group German speakers by English intonation ability.
ISLE Speech Corpus (1) • Re-use of speech corpus collected in ISLE Interactive Spoken Languge Education project. • Leeds University, Universität Hamburg, Università di Milano-Bicocca, Entropic Ltd., Ernst Klett Verlag GmbH, and Dida*El S.R.L. • Time-aligned audio recordings from 23 German and 23 Italian spoken learners’ English + 2 Native English Speakers.
ISLE Speech Corpus (2) • Speaker adaptation • 82 sentences edited from ‘The Ascent of Everest’ e.g. ‘It is in fact a story of many years, in which men tried to climb that mountain.’ • Typical EFL exercises • Minimal Pairs and Polysyllabic words e.g. ‘I said bad not bed.’ ‘He's a photographer.’
ISLE Speech Corpus (3) • Annotated corpus • Pronuciation errors at word- and phone-levels • Stress errors at word level • Prosodic annotation was added to a written transcription of the speech corpus in our research.
HTK Hidden Markov Model Toolkit • Developed at Cambridge University Engineering Depertment (CUED). • Free toolkit for building Hidden Markov Models (HMMs). • Module call: available from both command line and script file. • Used in speech recognition research and other pattern recogntion research. e.g. Hand writing recognition Facial recognition
Prosodic Annotation • Purpose: Predict ‘model’ intonation patterns to be compared against German spoken learners’ English. • Instructions: ‘From text structure to prosodic structure’ (Knowles, 1996) • Environment: Windows Excel • Amount: First 27 sentences from ‘the Ascent of Everest’
Result of Prosodic Annotation (1) • 27 sentences, consisting of 429 words, were divided into 84 tone groups: prosodic ‘phrases‘. → 1 ‘low rise ’, 3 ‘high rise’, 52 ‘fall-rise’ and 28 ‘fall’ patterns. • First 10 sentences were modified according to native speakers‘ recordings. → 15 ‘fall-rise’ and 10 ‘fall’ patterns • 1 ‘low rise’, 2 ‘high rise’ and 4 ‘fall-rise’ were deleted.
Result of Prosodic Annotation (2) (A_01)This is the story <HR> of how two men <FR> reached the top of Everest <FR> on the twenty-ninth of May nineteen fifty-three <FR> and came back safely <HR> to their friends below <F>. (A_02)Yet this will not be the whole story <F>. (A_03) The ascent of Everest <FR> was not the work of one day <FR>, nor even of those few unforgettable weeks <FR> in which we prepared and climbed that summer <F>.
Human Evaluation of German Spoken Learners’ English Intonation Abilities • Purpose: Group German speakers into ‘good’and ‘poor’ intonation groups. • Evaluator I: Computational linguistics researcher • Evaluator II: English language teaching researcher • Quantity: First 10 utterances from each speaker. • If all the tone types of an utterance was matched with model pattern, then it was judged as correct; otherwise incorrect.
Grouping of 23 German Speakers Grouping I: based on Evaluator I (Computational linguistics researcher) Grouping II: based on Evaluator II (English language teaching (ELT) researcher) Grouping III: agreement of Evaluator I and II. 23speakers 3exceptionally poor pronunciation speakers 8good 4intermediate 8poor intonation speakers
Result of Human Evalution and Grouping • Two evaluators agreed about 63% (144 utterances out of 230) • Evaluator II marked 109 errors, while Evaluator I marked 78 errors. However, • 7 ‘poor’ and 5 ‘good’ speakers were same in Grouping I and Grouping II. → 2 speakers were added to ‘good’ intonation group in Grouping III.
Conditions of HTK Speech Recognition Experiments • Monophone and triphone HMMs were trained. • No language models were used. • Perl script and configuration file were used for module calls. • Number of training speakers: 6 speakers from the same intonation group. • Number of test speakers: 2 (1 for Grouping III) speakers from each group.
Results of HTK experiments • Recognition accuracy was generally higher when test and training speakers’ intonation abilities were same. • Improvement was higher against triphone HMMs. • Improvement was most significant in Experiment II. • One ‘poor’ intonation speaker showed negative improvement in all three experiments. • Another ‘poor’ speaker also showed the negative improvement in Experiment I.
Average Recognition Accuracies of Good Intonation Speakers(Parentheses show results against monophone HMMs)
Average Recognition Accuracies of Poor Intonation Speakers(Parentheses show results against monophone HMMs)
Prosodic Keywords • Tone type is decided by the last accented syllable. (Knowles, 1996) → We called word containing the last accented syllable of each tone group the ‘prosodic keyword’. → Recognition accuracy among ‘prosodic keywords’ was counted for triphone cases of Experiment II. • Improvement of recognition accuracy among prosodic keywords was higher that of overall. • Good test speakers: 26.00% (overall 19.20%) • Poor test speakers: 24.50% (overall 15.50%)
Irrelevance of Pronunciation Abilities • Good intonation speakers tended to have slightly better pronunication ability than poor intonation speakers, although 3 exceptionally poor pronunciatioin speakers had been excluded. → Additional experiments were executed taking 2 ‘best’ and 2 ‘worst’ pronunciation speakers from poor and good intonation groups, respectively. → Similar improvement was observed in this experiment too.
Conclusions • Matching of test and training speakers’ intonation abilities brought about higher recognition accuracy. • HTK was able to distinguish ‘good’ and ‘poor’ intonation. • Confirmed that German speakers’ weakness of English intonation was generally ‘fall-rise’ patterns. • Human evaluation was successful enough.
Future Work • Expand tone types. (not only for ‘fall-rise’and ‘fall’ patterns) • Applied to other languages and to different native-speaker groups. • Use of results in practical language-teaching systems.