Development of the Embedded Speech Recognition Interface done for AIBO

Development of the Embedded Speech Recognition Interface done for AIBO ICSI Presentation January 2003

Xavier Menendez-Pidal, jointly with: Gustavo Hernandez-Abrego, Lei Duan,Lex Olorenshaw, Honda Hitoshi, Helmut Luke Spoken Language Technology, SONY NSCA 3300 Zanker Rd MS/SJ1B5, San Jose CA E-mail:xavier@slt.sel.sony.com

ABSTRACT This presentation highlights three major key techniques used in the embedded isolated command recognition system developed for AIBO: • Robust Broadband HMMs • Small Context dependent HMMs • Efficient Confidence Measure (Task independent)

Sony’s AIBO entertainment robot

Clean Speech + Mixed with Noises + Artificially Reverberated • Noise Attenuation: NSS • Channel Normalization: CMS, CMV, or DB eq. End-point Detection + Feature Extraction • ASR based on PLUs: • Engine based on Viterbi with Beam Search • Lexicon ~ 100 to 300 Dictionaries Entries • 3 states/1 Gaussian per State CHMM Triphone ASR + CM for Speech Verification Others Sensors: -Vision -Tact AIBO: -Activity -Personality, mud AIBO Dialogue Manager General AIBO ASR Overview and Features

HMM Training Strategies 1 TRAINING OBJECTIVES: • Obtain a robust recognizer in noisy far field conditions: • We use SIMULATE noisy Matched conditions by : • Mixing Clean speech with expected noises at target SNR • Artificially reverberate the training Corpus using the frequency Response filter of expected far field Room environments (0.5 ~ 1.5m) • Obtain an accurate recognizer in near field conditions high SNR conditions. • The recognizer should be close to real-time. A Tradeoff is obtained by training in match noisy conditions and clean speech conditions: “Broadband HMM”

Robust “Broadband” HMMs Clean Speech Room_Response_1 * Speech + Noise_1 Room_Response_N * Speech + Noise_N HMM-Accumulators Noise+Reverberation N HMM-Accumulators Noise+Reverberation1 ~N Clean Accumulators Final Broadband HMM

Embedded ASR System Specification • HMM with Small Memory Size : < 500 Kb • CPU efficient ASR: • The CPU can calculated a Maximum of Gaussians 300 per frame • Compress front-end, 20 features: 6 Mfcc + 7 delta-MFCC + 7 delta2-MFCC • Vocabulary can be easily modified: Phone based approach

Monophone vs Triphone

Thres CM Generator CM>Thres no yes reject or ask for confirmation perform AIBO action CM computation

AM N-best Recognizer Hypo 1 SPEECH Hypo 2 . . . Hypo N Vocabulary Recognition process

CM Formulation 1 Likelihood score ratio: Approximation with the N-best: Used in combination with A test for in-vocabulary errors, A confidence measure is built:

Pseudo-filler score Pseudo-background score CM Formulation 2 S1 Saverage SN Confidence value [0,1] Number of hypos in the list i-th score in the N-best list

CM Thresholds for several AM’s and AIBO life

CM thresholds for different vocabularies

Conclusions • Broadband HMMs provide a convenient tradeoff between noise robustness and accuracy in quite conditions. • HMM with Context dependent units (triphones or biphones) and 1 Gaussian/State are computationally less expensive and more accurate than monophones and more robust to noise. • The CM presented is very simple to compute yet effective to categorize correct results from incorrect ones and OOV’s. • CMs are robust to changes in the vocabulary and architecture of the recognizer. • Due to its simplicity and stability, the CM looks appealing for real-life command applications.

References • H. Lucke, H Honda, K Minamino, A Hiroe, H Mori, H Ogawa, Y Asano, H Kishi, “Development of a Spontaneous Speech Rcognition engine for an Entertainment Robot”, ISCA IEEE Workshop on Spontaneous Speech Processing and Recognition (SSPR), Tokyo, 2003. • G. Hernández Ábrego, X. Menéndez-Pidal, Thomas Kemp, K Minamino, H Lucke, “Automatic Set-up for Spontaneous Speech Recognition Engines Based on Merit Optimization”, ICASSP-2003, HongKong • Xavier Menéndez-Pidal, Lei Duan, Jingwen Lu, Beatriz Dukes, Michael Emonts, Gustavo Hernández-Ábrego, Lex Olorenshaw “Efficient phone-base Recognition Engines for Chinese and English Isolated command applications”, International Symposium on Chinese Spoken Language Processing (ISCSLP) Taipei, Taiwan, August 2002 • G. Hernández Ábrego, X. Menéndez-Pidal, L. Olorenshaw, "Robust and Efficient Confidence measure for Isolated command application", in Proceedings of Automatic Speech Recognition and Understanding Workshop, Madonna di Campiglio, Trento, Italy, December 2001

Development of the Embedded Speech Recognition Interface done for AIBO

Development of the Embedded Speech Recognition Interface done for AIBO

Presentation Transcript

Speech Recognition

the promise of speech recognition

Speech Recognition

Using Speech Recognition for Speech Therapy

Speech Recognition

Speech recognition

Combining Speech Attributes for Speech Recognition

Speech Recognition

Speech Recognition

DTW for Speech Recognition

Speech Recognition

Dynamic Color Recognition for the Aibo

Speech Recognition

Speech Recognition

SPEECH RECOGNITION:

Current Challenges in Embedded Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition for Dummies

Speech Recognition