190 likes | 348 Views
Development of the Embedded Speech Recognition Interface done for AIBO. ICSI Presentation January 2003. Xavier Menendez-Pidal, jointly with: Gustavo Hernandez-Abrego, Lei Duan, Lex Olorenshaw, Honda Hitoshi, Helmut Luke. Spoken Language Technology, SONY NSCA
E N D
Development of the Embedded Speech Recognition Interface done for AIBO ICSI Presentation January 2003
Xavier Menendez-Pidal, jointly with: Gustavo Hernandez-Abrego, Lei Duan,Lex Olorenshaw, Honda Hitoshi, Helmut Luke Spoken Language Technology, SONY NSCA 3300 Zanker Rd MS/SJ1B5, San Jose CA E-mail:xavier@slt.sel.sony.com
ABSTRACT This presentation highlights three major key techniques used in the embedded isolated command recognition system developed for AIBO: • Robust Broadband HMMs • Small Context dependent HMMs • Efficient Confidence Measure (Task independent)
Clean Speech + Mixed with Noises + Artificially Reverberated • Noise Attenuation: NSS • Channel Normalization: CMS, CMV, or DB eq. End-point Detection + Feature Extraction • ASR based on PLUs: • Engine based on Viterbi with Beam Search • Lexicon ~ 100 to 300 Dictionaries Entries • 3 states/1 Gaussian per State CHMM Triphone ASR + CM for Speech Verification Others Sensors: -Vision -Tact AIBO: -Activity -Personality, mud AIBO Dialogue Manager General AIBO ASR Overview and Features
HMM Training Strategies 1 TRAINING OBJECTIVES: • Obtain a robust recognizer in noisy far field conditions: • We use SIMULATE noisy Matched conditions by : • Mixing Clean speech with expected noises at target SNR • Artificially reverberate the training Corpus using the frequency Response filter of expected far field Room environments (0.5 ~ 1.5m) • Obtain an accurate recognizer in near field conditions high SNR conditions. • The recognizer should be close to real-time. A Tradeoff is obtained by training in match noisy conditions and clean speech conditions: “Broadband HMM”
Robust “Broadband” HMMs Clean Speech Room_Response_1 * Speech + Noise_1 Room_Response_N * Speech + Noise_N HMM-Accumulators Noise+Reverberation N HMM-Accumulators Noise+Reverberation1 ~N Clean Accumulators Final Broadband HMM
Embedded ASR System Specification • HMM with Small Memory Size : < 500 Kb • CPU efficient ASR: • The CPU can calculated a Maximum of Gaussians 300 per frame • Compress front-end, 20 features: 6 Mfcc + 7 delta-MFCC + 7 delta2-MFCC • Vocabulary can be easily modified: Phone based approach
Thres CM Generator CM>Thres no yes reject or ask for confirmation perform AIBO action CM computation
AM N-best Recognizer Hypo 1 SPEECH Hypo 2 . . . Hypo N Vocabulary Recognition process
CM Formulation 1 Likelihood score ratio: Approximation with the N-best: Used in combination with A test for in-vocabulary errors, A confidence measure is built:
Pseudo-filler score Pseudo-background score CM Formulation 2 S1 Saverage SN Confidence value [0,1] Number of hypos in the list i-th score in the N-best list
Conclusions • Broadband HMMs provide a convenient tradeoff between noise robustness and accuracy in quite conditions. • HMM with Context dependent units (triphones or biphones) and 1 Gaussian/State are computationally less expensive and more accurate than monophones and more robust to noise. • The CM presented is very simple to compute yet effective to categorize correct results from incorrect ones and OOV’s. • CMs are robust to changes in the vocabulary and architecture of the recognizer. • Due to its simplicity and stability, the CM looks appealing for real-life command applications.
References • H. Lucke, H Honda, K Minamino, A Hiroe, H Mori, H Ogawa, Y Asano, H Kishi, “Development of a Spontaneous Speech Rcognition engine for an Entertainment Robot”, ISCA IEEE Workshop on Spontaneous Speech Processing and Recognition (SSPR), Tokyo, 2003. • G. Hernández Ábrego, X. Menéndez-Pidal, Thomas Kemp, K Minamino, H Lucke, “Automatic Set-up for Spontaneous Speech Recognition Engines Based on Merit Optimization”, ICASSP-2003, HongKong • Xavier Menéndez-Pidal, Lei Duan, Jingwen Lu, Beatriz Dukes, Michael Emonts, Gustavo Hernández-Ábrego, Lex Olorenshaw “Efficient phone-base Recognition Engines for Chinese and English Isolated command applications”, International Symposium on Chinese Spoken Language Processing (ISCSLP) Taipei, Taiwan, August 2002 • G. Hernández Ábrego, X. Menéndez-Pidal, L. Olorenshaw, "Robust and Efficient Confidence measure for Isolated command application", in Proceedings of Automatic Speech Recognition and Understanding Workshop, Madonna di Campiglio, Trento, Italy, December 2001