Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

Performance Analysis of Advanced Front Endson the Aurora Large Vocabulary Evaluation • Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State University • Contact Information: Box 9571 Mississippi State University Mississippi State, Mississippi 39762 Tel: 662-325-8335 Fax: 662-325-2298 Email: {parihar,picone}@isip.msstate.edu • URL: http://www.isip.msstate.edu/publications/seminars/msstate_misc/2004/ gsa/

INTRODUCTION BLOCK DIAGRAM APPROACH Core components: • Transduction • Feature extraction • Acoustic modeling (hidden Markov models) • Language modeling (statistical N-grams) • Search (Viterbi beam) • Knowledge sources

Client/server applications • Evaluate robustness in noisy environments • Propose a standard for LVCSR applications INTRODUCTION AURORA EVALUATION OVERVIEW • WSJ 5K (closed task) with seven (digitally-added) noise conditions • Common ASR system • Two participants: QIO: QualC., ICSI, OGI; MFA: Moto., FrTel., Alcatel

INTRODUCTION MOTIVATION ALV Evaluation Results • Aurora Large Vocabulary (ALV) evaluation goal was at least a 25% relative improvement over the baseline MFCC front end • Is the 31% relative improvement (34.5% vs. 50.3%) operationally significant ? • Generic baseline LVCSR system with no front end specific tuning • Would front end specific tuning change the rankings?

EVALUATION PARADIGM THE AURORA – 4 DATABASE Acoustic Training: • Derived from 5000 word WSJ0 task • TS1 (clean), and TS2 (multi-condition) • Clean plus 6 noise conditions • Randomly chosen SNR between 10 and 20 dB • 2 microphone conditions (Sennheiser and secondary) • 2 sample frequencies – 16 kHz and 8 kHz • G.712 filtering at 8 kHz and P.341 filtering at 16 kHz • Development and Evaluation Sets: • Derived from WSJ0 Evaluation and Development sets • 14 test sets for each • 7 test sets recorded on Sennheiser; 7 on secondary • Clean plus 6 noise conditions • Randomly chosen SNR between 5 and 15 dB • G.712 filtering at 8 kHz and P.341 filtering at 16 kHz

EVALUATION PARADIGM BASELINE LVCSR SYSTEM Training Data Standard context-dependent cross-word HMM-based system: • Acoustic models: state-tied4-mixture cross-word triphones • Language model: WSJ0 5K bigram • Search: Viterbi one-best using lexical trees for N-gram cross-word decoding • Lexicon: based on CMUlex • Real-time: 4 xRT for training and 15 xRT for decoding on an800 MHz Pentium Monophone Modeling CD-Triphone Modeling State-Tying CD-Triphone Modeling Mixture Modeling (2,4)

/ EVALUATION PARADIGM WI007 ETSI MFCC FRONT END Input Speech • Zero-mean debiasing • 10 ms frame duration • 25 ms Hamming window • Absolute energy • 12 cepstral coefficients • First and second derivatives Zero-mean and Pre-emphasis Energy Fourier Transf. Analysis Cepstral Analysis

FRONT END PROPOSALS QIO FRONT END Input Speech • 10 msec frame duration • 25 msec analysis window • 15 RASTA-like filtered cepstral coefficients • MLP-based VAD • Mean and variance normalization • First and second derivatives Fourier Transform Mel-scale Filter Bank MLP-based VAD RASTA DCT Mean/Variance Normalization /

Input Speech Noise Reduction VADNest Waveform Processing Cepstral Analysis Blind Equalization Feature Processing VAD / FRONT END PROPOSALS MFA FRONT END • 10 msec frame duration • 25 msec analysis window • Mel-warped Wiener filter based noise reduction • Energy-based VADNest • Waveform processing to enhance SNR • Weighted log-energy • 12 cepstral coefficients • Blind equalization (cepstral domain) • VAD based on acceleration of various energy based measures • First and second derivatives

EXPERIMENTAL RESULTS FRONT END SPECIFIC TUNING • Pruning beams (word, phone and state) were opened during the tuning process to eliminate search errors. • Tuning parameters: • State-tying thresholds:solves the problem of sparsity of training data by sharing state distributions among phonetically similar states • Language model scale:controls influence of the language model relative to the acoustic models (more relevant for WSJ) • Word insertion penalty:balances insertions and deletions (always a concern in noisy environments)

QIO FE - 7.5% relative improvement MFA FE - 9.4% relative improvement Ranking is still the same (14.9% vs. 12.5%) ! EXPERIMENTAL RESULTS FRONT END SPECIFIC TUNING

EXPERIMENTAL RESULTS COMPARISON OF TUNING • Same Ranking: relative performance gap increased from9.6% to 15.8% • On TS1, MFA FE significantly better on all 14 test sets (MAPSSWE p=0.1%) • On TS2, MFA FE significantly better only on test sets 5 and 14

40 30 ETSI MFA QIO 20 10 0 Sennheiser Secondary EXPERIMENTAL RESULTS MICROPHONE VARIATION • Train on Sennheiser mic.; evaluate on secondary mic. • Matched conditions result in optimal performance • Significant degradation for all front ends on mismatched conditions • Both QIO and MFA provide improved robustness relative to MFCC baseline

EXPERIMENTAL RESULTS 70 • Performance degrades on noise condition when systems are trained only on clean data • Both QIO and MFA deliver improved performance 60 50 40 30 ETSI MFA QIO 20 10 0 TS2 TS3 TS4 TS5 TS6 TS7 40 • Exposing systems to noise and microphone variations (TS2) improves performance 30 20 10 0 TS2 TS3 TS4 TS5 TS6 TS7 ADDITIVE NOISE

SUMMARY AND CONCLUSIONS WHAT HAVE WE LEARNED? • Both QIO and MFA front ends achieved ALV evaluation goal of improving performance by at least 25% relative over ETSI baseline • WER is still high ( ~ 35%), human benchmarks have reported low error rates (~1%). Improvement in performance is not operationally significant • Front end specific parameter tuning did not result in significant change in overall performance (MFA still outperforms QIO) • Both QIO and MFA front ends handle convolution and additive noise better than ETSI baseline

Aurora Project Website: recognition toolkit, multi-CPU scripts, database definitions, publications, and performance summary of the baseline MFCC front end • Speech Recognition Toolkits: compare front ends to standard approaches using a state of the art ASR toolkit • ETSI DSR Website: reports and front end standards APPENDIX AVAILABLE RESOURCES

Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation