330 likes | 342 Views
This study delves into the HIWIRE project's research on improving environmental and speaker robustness in Automatic Speech Recognition (ASR) systems. The project aims to address challenges related to noise and speaker variations, showcasing fixed cockpit and PDA platforms, exploring multi-microphone ASR, feature extraction techniques, and robust modeling. Experimental results and ongoing work highlight advancements in beamforming, adaptive noise cancellation, robust feature selection, and modulation and acoustic features. The project collaborates with industrial and research partners to enhance ASR performance through model transformation, speech statistical processing, and feature fusion.
E N D
Towards speaker and environmental robustness in ASR: the HIWIRE project A. Potamianos1,G. Bouselmi2,D. Dimitriadis3,D. Fohr2,R. Gemello4,I. Illina2, F. Mana4, P. Maragos3,M. Matassoni5,V. Pitsikalis3,J. Ramírez6,E. Sanchez-Soto1,J. Segura6, andP. Svaizer5 1 Dept. of E.C.E., Tech. Univ. of Crete, Chania, Greece 2 Speech Group, LORIA, Nancy, France 3School of E.C.E., Natl. Tech. Univ. of Athens, Athens, Greece 4 Loquendo, via Valdellatorre, 4-10149, Torino, Italy 5 ITC-irst, via Sommarive 18 - Povo (TN), Italy 6Dept. of Signal Theory, Univ. of Granada, Spain
Outline • Introduction: the HIWIRE project • Goals and objectives • Research areas: • Environmental robustness • Speaker robustness • Experimental results • Ongoing work
HIWIRE project • http://www.hiwire.org • Goals: environment and speaker robust ASR • Showcase: fixed cockpit platform, PDA platform • Industrial partners: Thales Avionics, Loquendo • Research partners: LORIA, TUC, NTUA, UGR, ITC-IRST, Thales research • FP6 project: 6/2004 to 5/2007
Research areas • Environmental robustness • Multi-microphone ASR • Robust feature extraction • Feature fusion and audio-visual ASR • Feature equalization • Voice-activity detection • Speech enhancement • Speaker robustness • Model-transformation • Acoustic modeling for non-native speech
Multi-microphone ASR: Outline • Beamforming and Adaptive Noise Cancellation • Environmental Acoustics Estimation
Beamforming: D&S Availability of multi-channel signals allows to selectively capture the desired source: • Issues: • estimation of reliable TDOAs; • Method: • CSP analysis over multiple frames • Advantages: • robustness • reduced computational power
D&S with MarkIII • Test set: • set N1_SNR0 of MC-TIDIGITS (cockpit noise), MarkIII channels • clean models, trained on original TIDIGITS • Results (WERR [%]):
Robust Features for ASR • Modulation Features • AM-FM Modulations • Teager Energy Cepstrum • Fractal Features • Dynamical Denoising • Correlation Dimension • Multiscale Fractal Dimension • Hybrid-Merged Features up to+62% (Aurora 3) up to+36%(Aurora 2) up to +61%(Aurora 2)
Speech Modulation Features • Filterbank Design • Short-Term AM-FM Modulation Features • Short-Term Mean Inst. Amplitude IA-Mean • Short-Term Mean Inst. Frequency IF-Mean • Frequency Modulation Percentages FMP • Short-Term Energy Modulation Features • Average Teager Energy, Cepstrum Coef. TECC
Regularization + Multiband Filtering Demodulation Robust Feature Transformation/ Selection Nonlinear Processing Speech Statistical Processing AM-FM Modulation Features: Mean Inst. Ampl. IA-Mean Mean Inst. Freq. IF-Mean Freq. Mod. Percent. FMP V.A.D. Energy Features: Teager Energy Cepstrum Coeff.TECC Modulation Acoustic Features
TIMIT-based Speech Databases • TIMIT Database: • Training Set: 3696 sentences , ~35 phonemes/utterances • Testing Set: 1344 utterances, 46680 phonemes • Sampling Frequency 16 kHz • Feature Vectors: • MFCC+C0+AM-FM+1st+2nd Time Derivatives • Stream Weights: (1) for MFCC and (2) for ΑΜ-FM • 3-state left-right HMMs, 16 mixtures • All-pair, Unweighted grammar • Performance Criterion: Phone Accuracy Rates (%) • Back-end System: HTK v3.2.0
Results: TIMIT+Noise Up to+106%
Aurora 3 - Spanish • Connected-Digits, Sampling Frequency 8 kHz • Training Set: • WM (Well-Matched): 3392 utterances (quiet 532, low 1668 and high noise 1192 • MM (Medium-Mismatch): 1607 utterances (quiet 396 and low noise 1211) • HM (High-Mismatch): 1696 utterances (quiet 266, low 834 and high noise 596) • Testing Set: • WM: 1522 utterances (quiet 260, low 754 and high noise 508), 8056 digits • MM: 850 utterances (quiet 0, low 0 and high noise 850), 4543 digits • HM: 631 utterances (quiet 0, low 377 and high noise 254), 3325 digits • 2 Back-end ASR Systems (ΗΤΚ and BLasr) • Feature Vectors: MFCC+AM-FM(or Auditory+ΑM-FM), TECC • All-Pair, Unweighted Grammar (or Word-Pair Grammar) • Performance Criterion: Word (digit) Accuracy Rates
Results: Aurora 3 Up to+62%
FDCD N-d Cleaned speech signal N-d Signal Local SVD Embedding Filtered Dynamics - Correlation Dimension MFD Geometrical Filtering Multiscale Fractal Dimension Filtered Embedding Noisy Embedding Fractal Features
Databases: Aurora 2 • Task: Speaker Independent Recognition of Digit Sequences • TI - Digits at 8kHz • Training (8440 Utterances per scenario, 55M/55F) • Clean (8kHz, G712) • Multi-Condition (8kHz, G712) • 4 Noises (artificial): subway, babble, car, exhibition • 5 SNRs : 5, 10, 15, 20dB , clean • Testing, artificially added noise • 7 SNRs: [-5, 0, 5, 10, 15, 20dB , clean] • A: noises as in multi-cond train., G712 (28028 Utters) • B: restaurant, street, airport, train station, G712 (28028 Utters) • C: subway, street (MIRS) (14014 Utters)
Results: Aurora 2 Up to +61%
Feature Fusion • Merge synchronous feature streams • Investigate both supervised and unsupervised algorithms
Compute “optimal” exponent weights for each streams [HMM Gaussian mixture formulation; similar expressions for MM, naïve Bayes, Euclidean/Mahalonobois classifier] • Optimality in the sense of minimizing “total classification error” Feature Fusion: multi-stream
Multi-Stream Classification • Two class problem w1, w2 • Feature vector x is broken up into two independent streamsx1 and x2 • Stream weightss1 and s2 are used to “equalize” the “probabilities”
Multi-Stream Classification • Bayes classification decision • Non-unity weights increase Bayes error but estimation/modeling error may decrease • Stream weights can decrease total error • “Optimal” weights minimize estimation error variance z2
Optimal Stream Weights • Equal error rate in single-stream classifiers optimal stream weights are inversely proportional to the total stream estimation error variance
Optimal Stream Weights • Equal estimation error variance in each stream optimal weights are approximately inversely proportional to the single stream classification error
Experimental Results • Subset of CUAVE database used: • 36 speakers (30 training, 6 testing), 5 sequences of 10 digits per spkr. • Training set: 1500 digits (30x5x10) • Test set: 300 digits (6x5x10) • Features: • Audio: 39 features (MFCC_D_A) • Visual: 105 features (ROIDCT_D_A) • Multi-Streams HMM models, Middle Integration: • 8 state, left-to-right HMM whole-digit models • Single Gaussian mixture • AV-HMM uses separate audio and video feature streams
Assume: V2 / A2 = 2 SNR-indep. correlation 0.96 Optimal Stream Weights Results
Parametric non-linear equalization • Parametric histogram equalization • Smoother estimates • Bi-modal transformation (speech vs. non-speech)
Voice Activity Detection • Bi-spectrum based VAD • Support vector machine based VAD • Combination of VAD with speech enhancement
Speech Enhancement • Modified Wiener filtering with filter depending on global SNR • Modified Ephraim-Malah enhancement: based on the E-M spectral attenuation rule
Non Native Speech Recognition • Build non-native models by combining English and native models • Use phone confusion between English phones and native acoustic models to add alternate model paths • Extract confusion matrix automatically by running phone recognition using native model • Phone pronunciation depends on word grapheme, English phone [grapheme] -> french phone
Extracted rules English French English model /t/ // /t/ /t/ /k/ // /t/ // French models ExampleforEnglishphone/t/
Graphemic constraints • Example: • APPROACH /ah p r ow ch/ • APPROACH (A, ah) (PP, p) (R, r) (OA, ow) (CH, ch) • Alignment between graphemes and phones for each word of lexicon • Lexicon modification: add graphemes for each word • Confusion rules extraction • (grapheme, english phone) → list of non native phones • Example: (A, ah) → a
Ongoing Work • Front-end • combination and integration of algorithms • Fixed-platform demonstration • non-native speech demo • PDA-platform demonstration • Ongoing research