1 / 16

Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

This study evaluates the performance of advanced front ends on the Aurora large vocabulary system, proposing improvements and exploring the impact of specific front end tuning. The comparison of front end proposals and experimental results highlight the significance of different approaches in noisy environments for LVCSR applications.

ggoldstein
Download Presentation

Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Analysis of Advanced Front Endson the Aurora Large Vocabulary Evaluation • Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State University • Contact Information: Box 9571 Mississippi State University Mississippi State, Mississippi 39762 Tel: 662-325-8335 Fax: 662-325-2298 Email: {parihar,picone}@isip.msstate.edu • URL: http://www.isip.msstate.edu/publications/seminars/msstate_misc/2004/ gsa/

  2. INTRODUCTION BLOCK DIAGRAM APPROACH Core components: • Transduction • Feature extraction • Acoustic modeling (hidden Markov models) • Language modeling (statistical N-grams) • Search (Viterbi beam) • Knowledge sources

  3. Client/server applications • Evaluate robustness in noisy environments • Propose a standard for LVCSR applications INTRODUCTION AURORA EVALUATION OVERVIEW • WSJ 5K (closed task) with seven (digitally-added) noise conditions • Common ASR system • Two participants: QIO: QualC., ICSI, OGI; MFA: Moto., FrTel., Alcatel

  4. INTRODUCTION MOTIVATION ALV Evaluation Results • Aurora Large Vocabulary (ALV) evaluation goal was at least a 25% relative improvement over the baseline MFCC front end • Is the 31% relative improvement (34.5% vs. 50.3%) operationally significant ? • Generic baseline LVCSR system with no front end specific tuning • Would front end specific tuning change the rankings?

  5. EVALUATION PARADIGM THE AURORA – 4 DATABASE Acoustic Training: • Derived from 5000 word WSJ0 task • TS1 (clean), and TS2 (multi-condition) • Clean plus 6 noise conditions • Randomly chosen SNR between 10 and 20 dB • 2 microphone conditions (Sennheiser and secondary) • 2 sample frequencies – 16 kHz and 8 kHz • G.712 filtering at 8 kHz and P.341 filtering at 16 kHz • Development and Evaluation Sets: • Derived from WSJ0 Evaluation and Development sets • 14 test sets for each • 7 test sets recorded on Sennheiser; 7 on secondary • Clean plus 6 noise conditions • Randomly chosen SNR between 5 and 15 dB • G.712 filtering at 8 kHz and P.341 filtering at 16 kHz

  6. EVALUATION PARADIGM BASELINE LVCSR SYSTEM Training Data Standard context-dependent cross-word HMM-based system: • Acoustic models: state-tied4-mixture cross-word triphones • Language model: WSJ0 5K bigram • Search: Viterbi one-best using lexical trees for N-gram cross-word decoding • Lexicon: based on CMUlex • Real-time: 4 xRT for training and 15 xRT for decoding on an800 MHz Pentium Monophone Modeling CD-Triphone Modeling State-Tying CD-Triphone Modeling Mixture Modeling (2,4)

  7. / EVALUATION PARADIGM WI007 ETSI MFCC FRONT END Input Speech • Zero-mean debiasing • 10 ms frame duration • 25 ms Hamming window • Absolute energy • 12 cepstral coefficients • First and second derivatives Zero-mean and Pre-emphasis Energy Fourier Transf. Analysis Cepstral Analysis

  8. FRONT END PROPOSALS QIO FRONT END Input Speech • 10 msec frame duration • 25 msec analysis window • 15 RASTA-like filtered cepstral coefficients • MLP-based VAD • Mean and variance normalization • First and second derivatives Fourier Transform Mel-scale Filter Bank MLP-based VAD RASTA DCT Mean/Variance Normalization /

  9. Input Speech Noise Reduction VADNest Waveform Processing Cepstral Analysis Blind Equalization Feature Processing VAD / FRONT END PROPOSALS MFA FRONT END • 10 msec frame duration • 25 msec analysis window • Mel-warped Wiener filter based noise reduction • Energy-based VADNest • Waveform processing to enhance SNR • Weighted log-energy • 12 cepstral coefficients • Blind equalization (cepstral domain) • VAD based on acceleration of various energy based measures • First and second derivatives

  10. EXPERIMENTAL RESULTS FRONT END SPECIFIC TUNING • Pruning beams (word, phone and state) were opened during the tuning process to eliminate search errors. • Tuning parameters: • State-tying thresholds:solves the problem of sparsity of training data by sharing state distributions among phonetically similar states • Language model scale:controls influence of the language model relative to the acoustic models (more relevant for WSJ) • Word insertion penalty:balances insertions and deletions (always a concern in noisy environments)

  11. QIO FE - 7.5% relative improvement MFA FE - 9.4% relative improvement Ranking is still the same (14.9% vs. 12.5%) ! EXPERIMENTAL RESULTS FRONT END SPECIFIC TUNING

  12. EXPERIMENTAL RESULTS COMPARISON OF TUNING • Same Ranking: relative performance gap increased from9.6% to 15.8% • On TS1, MFA FE significantly better on all 14 test sets (MAPSSWE p=0.1%) • On TS2, MFA FE significantly better only on test sets 5 and 14

  13. 40 30 ETSI MFA QIO 20 10 0 Sennheiser Secondary EXPERIMENTAL RESULTS MICROPHONE VARIATION • Train on Sennheiser mic.; evaluate on secondary mic. • Matched conditions result in optimal performance • Significant degradation for all front ends on mismatched conditions • Both QIO and MFA provide improved robustness relative to MFCC baseline

  14. EXPERIMENTAL RESULTS 70 • Performance degrades on noise condition when systems are trained only on clean data • Both QIO and MFA deliver improved performance 60 50 40 30 ETSI MFA QIO 20 10 0 TS2 TS3 TS4 TS5 TS6 TS7 40 • Exposing systems to noise and microphone variations (TS2) improves performance 30 20 10 0 TS2 TS3 TS4 TS5 TS6 TS7 ADDITIVE NOISE

  15. SUMMARY AND CONCLUSIONS WHAT HAVE WE LEARNED? • Both QIO and MFA front ends achieved ALV evaluation goal of improving performance by at least 25% relative over ETSI baseline • WER is still high ( ~ 35%), human benchmarks have reported low error rates (~1%). Improvement in performance is not operationally significant • Front end specific parameter tuning did not result in significant change in overall performance (MFA still outperforms QIO) • Both QIO and MFA front ends handle convolution and additive noise better than ETSI baseline

  16. Aurora Project Website: recognition toolkit, multi-CPU scripts, database definitions, publications, and performance summary of the baseline MFCC front end • Speech Recognition Toolkits: compare front ends to standard approaches using a state of the art ASR toolkit • ETSI DSR Website: reports and front end standards APPENDIX AVAILABLE RESOURCES

More Related