1 / 15

Burhan Necioglu Bryan George George Shuttic The MITRE Corporation

Burhan Necioglu Bryan George George Shuttic The MITRE Corporation. Ramasubramanian Sundaram Joe Picone Mississippi State U. Inst. for Signal & Information Processing.

cana
Download Presentation

Burhan Necioglu Bryan George George Shuttic The MITRE Corporation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Burhan Necioglu Bryan George George Shuttic The MITRE Corporation Ramasubramanian Sundaram Joe Picone Mississippi State U. Inst. for Signal & Information Processing The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP

  2. INTRODUCTION • Collaboration between The MITRE Corporation and Mississippi State Institute for Signal and Information Processing (ISIP) • Primary goal: Evaluate the impact of noise pre-processing developed for other DoD applications • MITRE: • Focus on robust speech recognition using noise reduction techniques, including effects of tactical communications links • Distributed information access systems for military applications (DARPA Communicator) • Mississippi State: • Focus on stable, practical, advanced LVCSR technology • Open source large vocabulary speech recognition tools • Training, education and dissemination of information related to all aspects of speech research • ISIP-STT System utilized combination of technologies from both organizations

  3. OVERVIEW OF THE SYSTEM • Standard MFCC front-end with side-based CMS • Acoustic modeling: • Left-right model topology • Skip states for special models like silence • Continuous density mixture Gaussian HMMs • Both Baum-Welch and Viterbi training supported • Phonetic decision tree-based state-tying • Hierarchical search Viterbi decoder

  4. STATE-TYING: MOTIVATION • Context-dependent models for better performance • Increased parameter count • Need to reduce computations without degrading performance

  5. FEATURES AND PERFORMANCE • Batch processing • Real-time performance of the training process during various stages:

  6. DECODER: OVERVIEW • Algorithmic features: • Single-pass decoding • Hierarchical Viterbi search • Dynamic network expansion • Functional features: • Cross-word context-dependent acoustic models • Word graph rescoring, forced alignments, N-gram decoding • Structural features: • Word graph compaction • Multiple pronunciations • Memory management

  7. EVALUATION SYSTEM - NOISE PREPROCESSING • Using Harsh Environment Noise Pre-Processor (HENPP) front-end to remove noise from input speech • HENPP developed by AT&T to address background noise effects in DoD speech coding environments (see Accardi and Cox, Malah et al, ICASSP 1999) • Multiplicative spectral processing - minimal distortion, eliminates “doodley-doos” (aka “musical noise”) • “Minimum statistics” noise adaptation - handles quasi-stationary additive noise (random and stochastic) without assumptions • Limitations: • Not designed to address transient noise • Noise adaptation sensitive to “push-to-talk” effects • Integrated 2.4 kbps MELP/HENPP demonstrated successfully in low- to moderate-perplexity ASR: LPC-10 MELP MELP/HENPP

  8. EVALUATION SYSTEM - DATA AND TRAINING • 10 hours of SPINE data used for training - no DRT words • 100 frames per second, 25msec Hamming window • 12 base FFT-derived mel cepstra with side-based CMS and log-energy • Delta and acceleration coefficients • 44 phone set to cover SPINE data • 909 models, 2725 states

  9. EVALUATION SYSTEM - LM and LEXICON • 5226 words in the SPINE lexicon, provided by CMU • CMU language model • Bigrams obtained by throwing away the trigrams • LM size: 5226 unigrams, 12511 bigrams

  10. EVALUATION SYSTEM - DECODING • Single stage decoding using word-internal acoustic models and bigram LM

  11. RESULTS AND ANALYSIS • Lattice generation/lattice rescoring will improve results. • Informal analysis of evaluation data and results: • Negative correlation between recognition performance and SNR

  12. RESULTS AND ANALYSIS (cont.) • Clean speech : “B” side of spine_eval_033 (281 total words) • Low SNR example: “A” side of spine_eval_021 (115 total words):

  13. RESULTS AND ANALYSIS (cont.) • HENPP designed for human listening purposes • Optimized to raise DRT scores in presence of noise and coding • DRT scores, WER tend to be poorly correlated; minor perceptual distortions often have magnified adverse effect on speech recognizers • Need to retune the HENPP • Algorithm is very effective for robust recognition of noisy speech at low SNR’s • Too aggressive when applied to clean speech - some information is lost • Minor adjustments will preserve noisy speech performance and boost clean speech performance

  14. ISSUES • Decoding slow on this task • 100x real-time (on 600 MHz Pentium) • Newer version of ISIP-STT decoder will be faster • Had to use bigram LM in the allowed time frame • Large amount of eval data • With slow decoding, seriously limited experiments • The devil is in the details: • Certain training data problematic “Noise field is<long silence> up” • Automatic segmentation (having eval segmentations would help)

  15. CONCLUSIONS • MITRE / MS State-ISIP system; standard recognition approach using advanced noise preprocessing front end • Time limitation: could only officially report on the baseline system • Performed initial experiment with noise-preprocessing (AT&T HENPP) • Overall word error rate did not improve • Informal analysis suggests that for low SNR conversations, noise pre-processing does help. • Difficulty with high SNR conversations • There is potential for improvement with application specific tuning of HENPP. • Approach is very promising for coded speech in commercial and military environments

More Related