320 likes | 434 Views
GSTC UGR. HIWIRE MEETING Granada, June 9-10, 2005. JOSÉ C. SEGURA, LUZ GARCÍA JAVIER RAMÍREZ. Schedule. Non-linear feature normalization ECDF segmental implementation Progressive equalization 2-class normalization Non-linear speaker adaptation/independence
E N D
GSTC UGR HIWIRE MEETINGGranada, June 9-10, 2005 JOSÉ C. SEGURA, LUZ GARCÍA JAVIER RAMÍREZ
Schedule • Non-linear feature normalization • ECDF segmental implementation • Progressive equalization • 2-class normalization • Non-linear speaker adaptation/independence • Non-linear feature normalization • Non-linear model adaptation • VAD and technique combination • MO-LRT • Bi-spectrum based VAD • Combined Front-End
Schedule • Non-linear feature normalization • ECDF segmental implementation • Progressive equalization • 2-class normalization • Non-linear speaker adaptation/independence • Non-linear feature normalization • Non-linear model adaptation • VAD and technique combination • MO-LRT • Bi-spectrum based VAD • Combined Front-End
ECDF-based nonlinear transformation (1) • CDF-matching nonlinear transformation • In previous works we modeled CDF’s by using histograms
ECDF-based nonlinear transformation (2) • An alternative algorithm based on Order Statistics • Is faster, only requires sorting and table indexing • Results are almost equal to those obtained with histograms
ECDF Segmental implementation • Based on a sliding window • José C. Segura, M. Carmen Benítez, Ángel de la Torre, Antonio J. Rubio, Javier Ramírez, Cepstral domain segmental nonlinear feature transformations for robust speech recognition, IEEE Signal Processing Letters.,Vol.11, pp. 666-669, 2004
Progressive normalization • As not all MFCC offer equal discrimination • And HEQ introduces certain distortion • Normalization up to a certain MFCC gives the best performance
Test01 Test02 C0 C1 2-class normalization (1) • A first approach on parametric non-linear equalization • PDF’s are modeled as two-Gaussian class mixtures for each MFCC • Actually we use speech/noise like classes • EM is used on each sentence to obtain the Gaussian classes
Equalization of C1 between Test02(Car) and Test01(Clean) of WSJ0 data 2-class normalization (2) Nonlinear parametric transformation
Schedule • Non-linear feature normalization • ECDF segmental implementation • Progressive equalization • 2-class normalization • Non-linear speaker adaptation/independence • Non-linear feature normalization • Non-linear model adaptation • VAD and technique combination • MO-LRT • Bi-spectrum based VAD • Combined Front-End
ECDF Features Normalization • HEQ as a non-linear speaker normalization technique using ECDF
ECDF Models Adaptation 2 APPROACHES • Pure Equalization: “HEQ MOD” new Gaussian Distributions: - shift on the means: X ->X HEQ - scale factor on the variances • Equalization mixed with linear transformation: “HEQ PLIN” LT: XA = M*X + B M’, B’ such that D(XA, XHEQ) = || M’X+B’ - XHEQ || 2 = minimum Speaker Specific Features Speaker Independent Features
Future Work 1/2 • SA models using MLLR are not robust against noise Feature Normalization + MLLR
Future Work 2/2 • Non linear Feature Normalization and Model Adaptation Development of further experiments with more complex tasks on WSJ1 database (spoke3 and spoke4)
Schedule • Non-linear feature normalization • ECDF segmental implementation • Progressive equalization • 2-class normalization • Non-linear speaker adaptation/independence • Non-linear feature normalization • Non-linear model adaptation • VAD and technique combination • MO-LRT • Bi-spectrum based VAD • Combined Front-End
Previous work on VAD • Voice activity detection: • Kullback-Leibler divergence • J. Ramírez, J. C. Segura, C. Benítez, A. de la Torre, A. Rubio, “A New Kullback-Leibler VAD for Robust Speech Recognition”, IEEE Signal Processing Letters, Vol.11, No.2, pp. 666-669, Feb. 2004 • Long-term spectral divergence • J. Ramírez, J. C. Segura, C. Benítez, A. de la Torre, A. Rubio, “Efficient Voice Activity Detection Algorithms Using Long-Term Speech Information”, Speech Communication, Vol. 42/3-4, pp. 271-287, 2004 • Subband SNR estimation using OS filters • J. Ramírez, J. C. Segura, C. Benítez, A. de la Torre, A. Rubio, “An Effective Subband OSF-based VAD with Noise Reduction for Robust Speech Recognition”, To appear in IEEE Transactions on Speech and Audio Processing, 2005/2006. • Multiple observation likelihood ratio test • J. Ramírez, J. C. Segura, C. Benítez, L. García, A. Rubio, “Statistical Voice Activity Detection using a Multiple Observation Likelihood Ratio Test”, To appear in IEEE Signal Processing Letters
Likelihood ratio test • Generalization of the Sohn’s VAD: • J. Sohn, N. S. Kim, W. Sung, “A statistical model-based voice activity detection”, IEEE Signal Processing Letters, vol. 16 (1), pp. 1-3, 1999. • Two hypothesis are considered: • H0 : y= nAbsence of speech (Silence) • H1 : y= s + nSpeech presence • Optimum decision rule (Bayes classifier): • l-frame observation vector: • LRT evaluation Adequate signal model LRT: Likelihood ratio test
Multiple observation likelihood ratio test • MO-LRT (multiple observation LRT): • Given a set of N= 2m+1 consecutive observations: • LRT: • Under statistical independence: • Recursive Log-LRT:
m Analysis: Optimum delay Probability distributions Classification errors • Increasing m (number of the observations): • Reduction of the overlap between the distributions • Misclassification errors: Reduced for speech vs Moderate increase for non-speech
MO-LRT Sohn’s VAD Analysis: Optimum delay • ROC analysisAURORA 3 Spanish (High-Ch1, 5dB)
Speech recognition experiments Frame dropping (FD) Wiener Filtering (WF) MFCC HTK Noise estimation VAD AURORA 2: Average Wacc (%) for CT and MCT
Speech recognition experiments AURORA 3: Spanish SpeechDat-Car
Work in progress • Statistical tests in the bispectrum domain: • J. M. Górriz, et al., “Voice Activity Detection Based on HOS”, 8th International Work-Conference on Artificial Neural Networks (IWANN'2005) • J. M. Górriz, et al., “Statistical Tests for Voice Activity Detection”, Non-linear Speech Processing (NOLISP’2005), 2005. • J. M. Górriz, et al., “Bispectra analysis-based VAD for robust speech recognition”, First International Work-Conference on the Interplay Between Natural and Artificial Computation (IWINAC’2005) • Bispectrum LRT (application of MO-LRT on the bispectra) • J. M. Górriz, et al, “An Improved MO-LRT VAD Based on a Bispectra Gaussian Model”, Submitted to Electronics Letters.
Segmental ECDF (Gaussian ref.) Progressive Noise reduction Frame dropping HTK LTSE VAD GSTC-UGR speech recognition results • LTSE VAD: • J. Ramírez, et al., “Efficient Voice Activity Detection Algorithms Using Long-Term Speech Information”, Speech Communication, Vol. 42/3-4, pp. 271-287, 2004 • Segmental ECDF: 60 frame delay • J. C. Segura, et al., “Cepstral Domain Segmental Nonlinear Feature Transformations for Robust Speech Recognition”, IEEE Signal Processing Letters, Vol.11, No. 5, pp. 517 - 520, 2004 • Progressive: • Log-E + Up to the 4th cepstral coefficient
GSTC-UGR speech recognition results WER Relative Improvements: 12% (MCT) 59% (CT) WER Relative Improvements: 60% (WM) 46% (MM) 73% (HM)
GSTC-UGR speech recognition results AURORA 4 WER (%) (clean training experiments) WER Relative Improvements: 20% (Test sets 1:7) 17% (Test sets 8:14)
GSTC UGR HIWIRE MEETINGGranada, June 9-10, 2005 JOSÉ C. SEGURA, LUZ GARCÍA JAVIER RAMÍREZ