150 likes | 327 Views
Burhan Necioglu Bryan George George Shuttic The MITRE Corporation. Ramasubramanian Sundaram Joe Picone Mississippi State U. Inst. for Signal & Information Processing.
E N D
Burhan Necioglu Bryan George George Shuttic The MITRE Corporation Ramasubramanian Sundaram Joe Picone Mississippi State U. Inst. for Signal & Information Processing The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP
INTRODUCTION • Collaboration between The MITRE Corporation and Mississippi State Institute for Signal and Information Processing (ISIP) • Primary goal: Evaluate the impact of noise pre-processing developed for other DoD applications • MITRE: • Focus on robust speech recognition using noise reduction techniques, including effects of tactical communications links • Distributed information access systems for military applications (DARPA Communicator) • Mississippi State: • Focus on stable, practical, advanced LVCSR technology • Open source large vocabulary speech recognition tools • Training, education and dissemination of information related to all aspects of speech research • ISIP-STT System utilized combination of technologies from both organizations
OVERVIEW OF THE SYSTEM • Standard MFCC front-end with side-based CMS • Acoustic modeling: • Left-right model topology • Skip states for special models like silence • Continuous density mixture Gaussian HMMs • Both Baum-Welch and Viterbi training supported • Phonetic decision tree-based state-tying • Hierarchical search Viterbi decoder
STATE-TYING: MOTIVATION • Context-dependent models for better performance • Increased parameter count • Need to reduce computations without degrading performance
FEATURES AND PERFORMANCE • Batch processing • Real-time performance of the training process during various stages:
DECODER: OVERVIEW • Algorithmic features: • Single-pass decoding • Hierarchical Viterbi search • Dynamic network expansion • Functional features: • Cross-word context-dependent acoustic models • Word graph rescoring, forced alignments, N-gram decoding • Structural features: • Word graph compaction • Multiple pronunciations • Memory management
EVALUATION SYSTEM - NOISE PREPROCESSING • Using Harsh Environment Noise Pre-Processor (HENPP) front-end to remove noise from input speech • HENPP developed by AT&T to address background noise effects in DoD speech coding environments (see Accardi and Cox, Malah et al, ICASSP 1999) • Multiplicative spectral processing - minimal distortion, eliminates “doodley-doos” (aka “musical noise”) • “Minimum statistics” noise adaptation - handles quasi-stationary additive noise (random and stochastic) without assumptions • Limitations: • Not designed to address transient noise • Noise adaptation sensitive to “push-to-talk” effects • Integrated 2.4 kbps MELP/HENPP demonstrated successfully in low- to moderate-perplexity ASR: LPC-10 MELP MELP/HENPP
EVALUATION SYSTEM - DATA AND TRAINING • 10 hours of SPINE data used for training - no DRT words • 100 frames per second, 25msec Hamming window • 12 base FFT-derived mel cepstra with side-based CMS and log-energy • Delta and acceleration coefficients • 44 phone set to cover SPINE data • 909 models, 2725 states
EVALUATION SYSTEM - LM and LEXICON • 5226 words in the SPINE lexicon, provided by CMU • CMU language model • Bigrams obtained by throwing away the trigrams • LM size: 5226 unigrams, 12511 bigrams
EVALUATION SYSTEM - DECODING • Single stage decoding using word-internal acoustic models and bigram LM
RESULTS AND ANALYSIS • Lattice generation/lattice rescoring will improve results. • Informal analysis of evaluation data and results: • Negative correlation between recognition performance and SNR
RESULTS AND ANALYSIS (cont.) • Clean speech : “B” side of spine_eval_033 (281 total words) • Low SNR example: “A” side of spine_eval_021 (115 total words):
RESULTS AND ANALYSIS (cont.) • HENPP designed for human listening purposes • Optimized to raise DRT scores in presence of noise and coding • DRT scores, WER tend to be poorly correlated; minor perceptual distortions often have magnified adverse effect on speech recognizers • Need to retune the HENPP • Algorithm is very effective for robust recognition of noisy speech at low SNR’s • Too aggressive when applied to clean speech - some information is lost • Minor adjustments will preserve noisy speech performance and boost clean speech performance
ISSUES • Decoding slow on this task • 100x real-time (on 600 MHz Pentium) • Newer version of ISIP-STT decoder will be faster • Had to use bigram LM in the allowed time frame • Large amount of eval data • With slow decoding, seriously limited experiments • The devil is in the details: • Certain training data problematic “Noise field is<long silence> up” • Automatic segmentation (having eval segmentations would help)
CONCLUSIONS • MITRE / MS State-ISIP system; standard recognition approach using advanced noise preprocessing front end • Time limitation: could only officially report on the baseline system • Performed initial experiment with noise-preprocessing (AT&T HENPP) • Overall word error rate did not improve • Informal analysis suggests that for low SNR conversations, noise pre-processing does help. • Difficulty with high SNR conversations • There is potential for improvement with application specific tuning of HENPP. • Approach is very promising for coded speech in commercial and military environments