1 / 17

Development of the Embedded Speech Recognition Interface done for AIBO

Development of the Embedded Speech Recognition Interface done for AIBO. ICSI Presentation January 2003. Xavier Menendez-Pidal, jointly with: Gustavo Hernandez-Abrego, Lei Duan, Lex Olorenshaw, Honda Hitoshi, Helmut Luke. Spoken Language Technology, SONY NSCA

gretel
Download Presentation

Development of the Embedded Speech Recognition Interface done for AIBO

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Development of the Embedded Speech Recognition Interface done for AIBO ICSI Presentation January 2003

  2. Xavier Menendez-Pidal, jointly with: Gustavo Hernandez-Abrego, Lei Duan,Lex Olorenshaw, Honda Hitoshi, Helmut Luke Spoken Language Technology, SONY NSCA 3300 Zanker Rd MS/SJ1B5, San Jose CA E-mail:xavier@slt.sel.sony.com

  3. ABSTRACT This presentation highlights three major key techniques used in the embedded isolated command recognition system developed for AIBO: • Robust Broadband HMMs • Small Context dependent HMMs • Efficient Confidence Measure (Task independent)

  4. Sony’s AIBO entertainment robot

  5. Clean Speech + Mixed with Noises + Artificially Reverberated • Noise Attenuation: NSS • Channel Normalization: CMS, CMV, or DB eq. End-point Detection + Feature Extraction • ASR based on PLUs: • Engine based on Viterbi with Beam Search • Lexicon ~ 100 to 300 Dictionaries Entries • 3 states/1 Gaussian per State CHMM Triphone ASR + CM for Speech Verification Others Sensors: -Vision -Tact AIBO: -Activity -Personality, mud AIBO Dialogue Manager General AIBO ASR Overview and Features

  6. HMM Training Strategies 1 TRAINING OBJECTIVES: • Obtain a robust recognizer in noisy far field conditions: • We use SIMULATE noisy Matched conditions by : • Mixing Clean speech with expected noises at target SNR • Artificially reverberate the training Corpus using the frequency Response filter of expected far field Room environments (0.5 ~ 1.5m) • Obtain an accurate recognizer in near field conditions high SNR conditions. • The recognizer should be close to real-time. A Tradeoff is obtained by training in match noisy conditions and clean speech conditions: “Broadband HMM”

  7. Robust “Broadband” HMMs Clean Speech Room_Response_1 * Speech + Noise_1 Room_Response_N * Speech + Noise_N HMM-Accumulators Noise+Reverberation N HMM-Accumulators Noise+Reverberation1 ~N Clean Accumulators Final Broadband HMM

  8. Embedded ASR System Specification • HMM with Small Memory Size : < 500 Kb • CPU efficient ASR: • The CPU can calculated a Maximum of Gaussians 300 per frame • Compress front-end, 20 features: 6 Mfcc + 7 delta-MFCC + 7 delta2-MFCC • Vocabulary can be easily modified: Phone based approach

  9. Monophone vs Triphone

  10. Thres CM Generator CM>Thres no yes reject or ask for confirmation perform AIBO action CM computation

  11. AM N-best Recognizer Hypo 1 SPEECH Hypo 2 . . . Hypo N Vocabulary Recognition process

  12. CM Formulation 1 Likelihood score ratio: Approximation with the N-best: Used in combination with A test for in-vocabulary errors, A confidence measure is built:

  13. Pseudo-filler score Pseudo-background score CM Formulation 2 S1 Saverage SN Confidence value [0,1] Number of hypos in the list i-th score in the N-best list

  14. CM Thresholds for several AM’s and AIBO life

  15. CM thresholds for different vocabularies

  16. Conclusions • Broadband HMMs provide a convenient tradeoff between noise robustness and accuracy in quite conditions. • HMM with Context dependent units (triphones or biphones) and 1 Gaussian/State are computationally less expensive and more accurate than monophones and more robust to noise. • The CM presented is very simple to compute yet effective to categorize correct results from incorrect ones and OOV’s. • CMs are robust to changes in the vocabulary and architecture of the recognizer. • Due to its simplicity and stability, the CM looks appealing for real-life command applications.

  17. References • H. Lucke, H Honda, K Minamino, A Hiroe, H Mori, H Ogawa, Y Asano, H Kishi, “Development of a Spontaneous Speech Rcognition engine for an Entertainment Robot”, ISCA IEEE Workshop on Spontaneous Speech Processing and Recognition (SSPR), Tokyo, 2003. • G. Hernández Ábrego, X. Menéndez-Pidal, Thomas Kemp, K Minamino, H Lucke, “Automatic Set-up for Spontaneous Speech Recognition Engines Based on Merit Optimization”, ICASSP-2003, HongKong • Xavier Menéndez-Pidal, Lei Duan, Jingwen Lu, Beatriz Dukes, Michael Emonts, Gustavo Hernández-Ábrego, Lex Olorenshaw “Efficient phone-base Recognition Engines for Chinese and English Isolated command applications”, International Symposium on Chinese Spoken Language Processing (ISCSLP) Taipei, Taiwan, August 2002 • G. Hernández Ábrego, X. Menéndez-Pidal, L. Olorenshaw, "Robust and Efficient Confidence measure for Isolated command application", in Proceedings of Automatic Speech Recognition and Understanding Workshop, Madonna di Campiglio, Trento, Italy, December 2001

More Related