1 / 23

Speech Technology for Language Learning

Speech Technology for Language Learning. Yoon Kim NeoSpeech, Inc. IEEE SCV Signal Processing Society Meeting February 5, 2004. About NeoSpeech. Launched August of 2002 Based in Fremont, CA Backed by Voiceware, Korea’s leading speech technology provider Products & Services

krita
Download Presentation

Speech Technology for Language Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech Technology for Language Learning Yoon Kim NeoSpeech, Inc. IEEE SCV Signal Processing Society Meeting February 5, 2004

  2. About NeoSpeech • Launched August of 2002 • Based in Fremont, CA • Backed by Voiceware, Korea’s leading speech technology provider • Products & Services • Core technology: TTS, ASR, Speaker Verification and Voice Animation • Applications: Computer assisted language learning; Automatic Outbound Notification • Over 36 Multimedia/PC, telephony and reseller customers • One of fastest-growing speech technology providers

  3. Outline • Introduction • What is CALL? • Why is speech technology useful for CALL? • Speech Technologies for CALL • Automatic Speech Recognition (ASR) • Text-to-Speech (TTS) Synthesis • Demos • Challenges and the Future

  4. Introduction

  5. CALL: Computer Aided Language Learning General term for using all computing resources for the acquisition, training and evaluation of language skills in the following areas: Reading, Writing, Listening, Speaking Why is CALL useful? Convenient, anytime access to language education Self-paced tool that aids human language instruction Can alleviate the fear of learning a new language through human to human interactions Human computer interactions can intrigue the young generation of users that are familiar with computers What is CALL?

  6. Why is Speech Technology Important in CALL? • Speech is perhaps the most effective way of communication between humans • Listening and speaking involve processing of speech from acoustic/phonetic and linguistic perspectives • Computers are multi-modal in nature • Speech technology enables systems to use these different modalities (speech, visual/haptic) for CALL • Results in a more complete interaction for students, increasing learning efficacy and user satisfaction

  7. Speech Technologies used in CALL Systems • Speech Input: Speech Recognition • ASR for grammar-based verbal interaction • Pronunciation Scoring • Detection/Feedback of Mispronunciation • Speech Output: Text-to-Speech • Listening and verification of dynamic content

  8. Automatic Speech Recognitionfor CALL

  9. Automatic Speech Recognition (ASR) and Understanding • Automatic Speech Recognition • Process of decoding the raw speech waveform and extracting linguistic information for human-machine communication • Speech Understanding • Process of comprehending communicative intent in addition to the linguistic decoding of the raw acoustic speech signal

  10. Cepstral Analysis Cepstral Analysis Objective: Given a sequence of acoustic feature X extracted, find the most likely word string that could have been uttered Speech Recognition Process • The input speech signal is converted to a sequence of feature vectors X , based on a cepstral, time-quefrency analysis. Input Speech “Call George Bush at home” Acoustic Front-end Acoustic Models P(X/W) • Acoustic models P(X|W) represent • sub-word units, such as phonemes, • as a finite-state machine in which • states model spectral structure and transitions model temporal structure. Language Model P(W) Search Recognized Utterance • The language model P(W) predicts the • next set of words, and controls which • models are hypothesized.

  11. ASR-CALL Applications

  12. ASR for Verbal Interaction • Use continuous grammar to handle words and phrases • Interaction specific, dynamic grammar • Applications: Interactive lessons with voice input using ASR as an option • Simple multiple choice questions • Fill in the blank questions • Word unscrambling drills

  13. Pronunciation Scoring • Scoring performed by analyzing the following cues from non-native and native acoustic models • Statistical match • Duration • Prosody • Rate of Speech • Grammar is singular and well defined • Scoring can be done at the following levels • Specific phone segments • Words/Phrases • Sentences • Overall student proficiency

  14. Mispronunciation Feedback Pronunciation of the word “Afternoon” Native : AE2 F T ERO N UW1 N Student : AE1 F T ELb N UW1 N Phones /AE1/, /ELb/ are detected as mispronunciations  Student is given tips on how to pronounce /AE/ and /ER/ correctly. • Detection • Similar to keyword spotting • Alternative pronunciation networks can be used • Detection hot list • Correction • Segment specific training • Confusable pair training (e.g. /r/ versus /l/ for Korean students) • Can provide feedback/tips on potential correction • Applications • Reading tutor for children • Detection and correction of common pronunciation mistakes (depends on the source language of student)

  15. Text-to-Speech Synthesisfor CALL

  16. Definition of TTS Synthesis • Text-To-Speech (TTS) Synthesis • Automatic production of acoustic speech waveform from arbitrary text input • Better than humans in some ways • Cheaper • Can be more intelligible • More flexible than recording • Worse than humans in other ways • Ungraceful degradation for longer sentences • Mechanical timbre

  17. Speech Synthesis Process My office was on St. Mary’s St. one block from the coffee shop. Input Text Text Processing My office was on Saint Mary’s Street, one block from the coffee shop. Prosody Prediction *My office |was on Saint *Mary’s Street || *one block | from the *coffee shop. Phonetic Processing m *ay ao1 f ax s|w ax z ao n s ey n t m *eh r iy z s t r iy t || w *ah n b l aa k | f r ax m dh ax k ao* f iy sh aa p Waveform Generation Synthesized Output

  18. TTS-CALL Applications

  19. TTS-Based Learning and Comprehension • Large-corpus, concatenative based TTS systems • Fortified grapheme-to-phoneme rules • Offers instant multimedia content generation for learning new words, phrases or sentences • Any text content can be “read out” using a TTS voice • Interactive, focused topics • Easy accessibility

  20. Demos and Conclusion

  21. Demos • NeoSpeech/Voiceware (www.neospeech.com) • Magic English Plus (TTS) • Cong Cong – Talking in English (ASR) • BravoBrava! (www.bravobrava.com) • SpeaK! (ASR/TTS)

  22. Challenges and the Future • ASR: Accuracy and Robustness of non-native based speech recognition • Variety of source and target language configurations • Micro-level pronunciation feedback • Normalizing speaker characteristics (acoustic, linguistic) and channel/environment • Robust rejection schemes • TTS: Accuracy and naturalness of TTS systems for advanced listening lessons • Combination of CALL with Spoken Language Translation

  23. Thank You!yoon.kim@neospeech.comwww.neospeech.com

More Related