Speech Technology for Language Learning

Speech Technology for Language Learning Yoon Kim NeoSpeech, Inc. IEEE SCV Signal Processing Society Meeting February 5, 2004

About NeoSpeech • Launched August of 2002 • Based in Fremont, CA • Backed by Voiceware, Korea’s leading speech technology provider • Products & Services • Core technology: TTS, ASR, Speaker Verification and Voice Animation • Applications: Computer assisted language learning; Automatic Outbound Notification • Over 36 Multimedia/PC, telephony and reseller customers • One of fastest-growing speech technology providers

Outline • Introduction • What is CALL? • Why is speech technology useful for CALL? • Speech Technologies for CALL • Automatic Speech Recognition (ASR) • Text-to-Speech (TTS) Synthesis • Demos • Challenges and the Future

Introduction

CALL: Computer Aided Language Learning General term for using all computing resources for the acquisition, training and evaluation of language skills in the following areas: Reading, Writing, Listening, Speaking Why is CALL useful? Convenient, anytime access to language education Self-paced tool that aids human language instruction Can alleviate the fear of learning a new language through human to human interactions Human computer interactions can intrigue the young generation of users that are familiar with computers What is CALL?

Why is Speech Technology Important in CALL? • Speech is perhaps the most effective way of communication between humans • Listening and speaking involve processing of speech from acoustic/phonetic and linguistic perspectives • Computers are multi-modal in nature • Speech technology enables systems to use these different modalities (speech, visual/haptic) for CALL • Results in a more complete interaction for students, increasing learning efficacy and user satisfaction

Speech Technologies used in CALL Systems • Speech Input: Speech Recognition • ASR for grammar-based verbal interaction • Pronunciation Scoring • Detection/Feedback of Mispronunciation • Speech Output: Text-to-Speech • Listening and verification of dynamic content

Automatic Speech Recognitionfor CALL

Automatic Speech Recognition (ASR) and Understanding • Automatic Speech Recognition • Process of decoding the raw speech waveform and extracting linguistic information for human-machine communication • Speech Understanding • Process of comprehending communicative intent in addition to the linguistic decoding of the raw acoustic speech signal

Cepstral Analysis Cepstral Analysis Objective: Given a sequence of acoustic feature X extracted, find the most likely word string that could have been uttered Speech Recognition Process • The input speech signal is converted to a sequence of feature vectors X , based on a cepstral, time-quefrency analysis. Input Speech “Call George Bush at home” Acoustic Front-end Acoustic Models P(X/W) • Acoustic models P(X|W) represent • sub-word units, such as phonemes, • as a finite-state machine in which • states model spectral structure and transitions model temporal structure. Language Model P(W) Search Recognized Utterance • The language model P(W) predicts the • next set of words, and controls which • models are hypothesized.

ASR-CALL Applications

ASR for Verbal Interaction • Use continuous grammar to handle words and phrases • Interaction specific, dynamic grammar • Applications: Interactive lessons with voice input using ASR as an option • Simple multiple choice questions • Fill in the blank questions • Word unscrambling drills

Pronunciation Scoring • Scoring performed by analyzing the following cues from non-native and native acoustic models • Statistical match • Duration • Prosody • Rate of Speech • Grammar is singular and well defined • Scoring can be done at the following levels • Specific phone segments • Words/Phrases • Sentences • Overall student proficiency

Mispronunciation Feedback Pronunciation of the word “Afternoon” Native : AE2 F T ERO N UW1 N Student : AE1 F T ELb N UW1 N Phones /AE1/, /ELb/ are detected as mispronunciations  Student is given tips on how to pronounce /AE/ and /ER/ correctly. • Detection • Similar to keyword spotting • Alternative pronunciation networks can be used • Detection hot list • Correction • Segment specific training • Confusable pair training (e.g. /r/ versus /l/ for Korean students) • Can provide feedback/tips on potential correction • Applications • Reading tutor for children • Detection and correction of common pronunciation mistakes (depends on the source language of student)

Text-to-Speech Synthesisfor CALL

Definition of TTS Synthesis • Text-To-Speech (TTS) Synthesis • Automatic production of acoustic speech waveform from arbitrary text input • Better than humans in some ways • Cheaper • Can be more intelligible • More flexible than recording • Worse than humans in other ways • Ungraceful degradation for longer sentences • Mechanical timbre

Speech Synthesis Process My office was on St. Mary’s St. one block from the coffee shop. Input Text Text Processing My office was on Saint Mary’s Street, one block from the coffee shop. Prosody Prediction *My office |was on Saint *Mary’s Street || *one block | from the *coffee shop. Phonetic Processing m *ay ao1 f ax s|w ax z ao n s ey n t m *eh r iy z s t r iy t || w *ah n b l aa k | f r ax m dh ax k ao* f iy sh aa p Waveform Generation Synthesized Output

TTS-CALL Applications

TTS-Based Learning and Comprehension • Large-corpus, concatenative based TTS systems • Fortified grapheme-to-phoneme rules • Offers instant multimedia content generation for learning new words, phrases or sentences • Any text content can be “read out” using a TTS voice • Interactive, focused topics • Easy accessibility

Demos and Conclusion

Demos • NeoSpeech/Voiceware (www.neospeech.com) • Magic English Plus (TTS) • Cong Cong – Talking in English (ASR) • BravoBrava! (www.bravobrava.com) • SpeaK! (ASR/TTS)

Challenges and the Future • ASR: Accuracy and Robustness of non-native based speech recognition • Variety of source and target language configurations • Micro-level pronunciation feedback • Normalizing speaker characteristics (acoustic, linguistic) and channel/environment • Robust rejection schemes • TTS: Accuracy and naturalness of TTS systems for advanced listening lessons • Combination of CALL with Spoken Language Translation

Thank You!yoon.kim@neospeech.comwww.neospeech.com

Speech Technology for Language Learning