1 / 32

CHiME Challenge:

CHiME Challenge:. Approaches to Robustness using Beamforming and Uncertainty-of-Observation Techniques. Dorothea Kolossa 1 , Ramón Fernandez Astudillo 2 , Alberto Abad 2 , Steffen Zeiler 1 , Rahim Saeidi 3 , Pejman Mowlaee 1 , João Paulo da Silva Neto 2 , Rainer Martin 1. 1.

trilby
Download Presentation

CHiME Challenge:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CHiME Challenge: Approaches to Robustness using Beamforming and Uncertainty-of-Observation Techniques Dorothea Kolossa 1, Ramón Fernandez Astudillo 2, Alberto Abad 2, Steffen Zeiler 1, Rahim Saeidi 3, Pejman Mowlaee 1, João Paulo da Silva Neto 2, Rainer Martin 1 1

  2. Overview • Uncertainty-Based Approach to Robust ASR • Uncertainty Estimation by Beamforming & Propagation • Recognition under Uncertain Observations • Further Improvements • Training: Full-covariance Mixture Splitting • Integration: Rover • Results and Conclusions 2

  3. Introduction: Uncertainty-Based Approach to ASR Robustness • Speech enhancement in time-frequency-domain is often very effective. • However, speech enhancement itself can neither • remove all distortions and sources of mismatch completely • nor can it avoid introducing artifacts itself Simple example: Time-Frequency Masking 3 Mixture

  4. Introduction: Uncertainty-Based Approach to ASR Robustness How can decoder handle such artificially distorted signals? One possible compromise: STFT Speech Processing Xkl Missing Feature HMM Speech Recognition m(n) Ykl Mkl Time-Frequency-Domain Problem: Recognition performs significantly better in other domains, such that missing feature approach may perform worse than feature reconstruction [1]. [1] B. Raj and R. Stern: „Reconstruction of Missing Features for Robust Speech Recognition“, Speech Communication 43, pp. 275-296, 2004.

  5. Introduction: Uncertainty-Based Approach to ASR Robustness Solution used here: Transform uncertain features to desired domain of recognition Uncertainty Propagation STFT Speech Processing Missing Data HMM Speech Recognition m(n) Ykl Xkl Mkl Recognition Domain TF-Domain 5

  6. Introduction: Uncertainty-Based Approach to ASR Robustness Solution used here: Transform uncertain features to desired domain of recognition Uncertainty Propagation STFT Speech Processing Missing Data HMM Speech Recognition m(n) Ykl p(Xkl|Ykl) Recognition Domain TF-Domain 6

  7. Introduction: Uncertainty-Based Approach to ASR Robustness Solution used here: Transform uncertain features to desired domain of recognition Uncertainty Propagation Uncertainty-based HMM Speech Recognition STFT Speech Processing c m(n) Ykl p(Xkl|Ykl) p(xkl|Ykl) Recognition Domain TF-Domain 7

  8. Uncertainty Estimation & Propagation • Posteriorestimationhereisperformedbyusingoneoffourbeamformers: • Delay andSum (DS) • GeneralizedSidelobeCanceller (GSC) [2] • Multichannel Wiener Filter (WPF) • Integrated Wiener Filteringwith Adaptive Beamformer (IWAB) [3] [2] O. Hoshuyama, A. Sugiyama, and A. Hirano, “A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters,” IEEE Trans. Signal Processing, vol. 47, no. 10, pp. 2677 –2684, 1999. [3] A. Abad and J. Hernando, “Speech enhancement and recognition by integrating adaptive beamforming and Wiener filtering,” in Proc. 8th International Conference on Spoken Language Processing (ICSLP), 2004, pp. 2657–2660. 8

  9. Uncertainty Estimation & Propagation • Posterior of clean speech, p(Xkl |Ykl ), is then propagated into domain of ASR • Feature Extraction • STSA-based MFCCs • CMS per utterance • possibly LDA 9

  10. Uncertainty Estimation & Propagation • Uncertainty model: Complex Gaussian distribution 10

  11. Uncertainty Estimation & Propagation • Twouncertaintyestimators: • Channel AsymmetryUncertaintyEstimation • Beamformeroutputinputto Wiener filter • Noise varianceestimatedassquaredchanneldifference • Posteriordirectlyobtainablefor Wiener filter [4]: ; 11 [4] R. F. Astudillo and R. Orglmeister, “A MMSE estimator in mel-cepstral domain for robust large vocabulary automatic speech recognition using uncertainty propagation,” in Proc. Interspeech, 2010, pp. 713–716.

  12. Uncertainty Estimation & Propagation • Two uncertainty estimators: • b) Equivalent Wiener variance • Beamformer output directly passed to feature extraction • Variance estimated using ratio of beamformer input and output, interpreted as Wiener gain 12 12 [4] R. F. Astudillo and R. Orglmeister, “A MMSE estimator in mel-cepstral domain for robust large vocabulary automatic speech recognition using uncertainty propagation,” in Proc. Interspeech, 2010, pp. 713–716.

  13. Uncertainty Propagation • Uncertainty propagation from [5] was used • Propagation through absolute value yields MMSE-STSA • Independent log normal distributions after filterbank assumed • Posterior of clean speech in cepstrum domain assumed Gaussian • CMS and LDA transformations simple 13 [5] R. F. Astudillo, “Integration of short-time Fourier domain speech enhancement and observation uncertainty techniques for robust automatic speech recognition,” Ph.D. thesis, Technical University Berlin, 2010.

  14. Recognition under Uncertain Observations • Standard observation likelihood for state q mixture m: • Uncertainty Decoding: • L. Deng, J. Droppo, and A. Acero, “Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion,” IEEE Trans. Speech and Audio Processing, vol. 13, no. 3, pp. 412–421, May 2005. • Modified Imputation: • Both uncertainty-of-observation techniques collapse to standard observation likelihood for Sx = 0. € € € D. Kolossa, A. Klimas, and R. Orglmeister, “Separation and robust recognition of noisy, convolutive speech mixtures using time-frequency masking and missing data techniques,” in Proc. Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Oct. 2005, pp. 82–85. 14

  15. Further Improvements • Training: Informed Mixture Splitting • Baum-Welch Training is only optimal locally -> good initialization and good split directions matter. • Therefore, considering covariance structure in mixture splitting is advantageous: split along maximum variance axis x2 x1 15

  16. Further Improvements • Training: Informed Mixture Splitting • Baum-Welch Training is only optimal locally -> good initialization and good split directions matter. • Therefore, considering covariance structure in mixture splitting is advantageous: split along first eigenvector of covariance matrix x2 x1 16

  17. Further Improvements • Integration: Recognizer output voting error reduction (ROVER) • Recognition outputs at word level are combined by dynamic programming on generated lattice, taking into account • the frequency of word labels and • the posterior word probabilities • We use ROVER on 3 jointly best systems selected on development set. • J. Fiscus, “A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER),” in IEEE Workshop on Automatic Speech Recognition and Understanding, Dec. 1997, pp. 347 –354. 17

  18. Results and Conclusions • Evaluation: • Two scenarios are considered, clean training and multicondition (‚mixed‘) training. • In mixed training, all training data was used at all SNR levels, artifically adding randomly selected noise from noise-only recordings. • Results are determined on the development set first. • After selecting the best performing system on development data, final results are obtained as keyword accuracies on the isolated sentences of the test set. 18

  19. Results and Conclusions • JASPER Results after clean training * JASPER usesfullcovariancetrainingwith MCE iterationcontrol. Token passingisequivalentto HTK. 19

  20. Results and Conclusions • JASPER Results after clean training * Best strategyhere: Delay andsumbeamformer + noiseestimation + modifiedimputation 20

  21. Results and Conclusions • HTK Results after clean training * Best strategyhere: Wiener postfilter + uncertaintyestimation 21

  22. Results and Conclusions • Results after clean training * Best strategyhere: Delay andsumbeamformer + noiseestimation 22

  23. Results and Conclusions • Overall Results after clean training * (JASPER +DS + MI) & (HTK+GSC+NE) & (JASPER+WPF+MI) 23

  24. Results and Conclusions • JASPER Results after multiconditiontraining 24

  25. Results and Conclusions • JASPER Results after multiconditiontraining * best JASPER setuphere: Delay andsumbeamformer + noiseestimation + modifiedimputation + LDA to 37d 25

  26. Results and Conclusions • JASPER Results after multiconditiontraining * best JASPER setuphere: Delay andsumbeamformer + noiseestimation + modifiedimputation + LDA to 37d 26

  27. Results and Conclusions • HTK Results after multiconditiontraining • * best HTK setuphere: Delay andsumbeamformer + noiseestimation 27

  28. Results and Conclusions • HTK Results after multiconditiontraining • * best HTK setuphere: Delay andsumbeamformer + noiseestimation 28

  29. Results and Conclusions • Overall Results after multiconditiontraining • * (JASPER +DS + MI + LDA ) & (JASPER+WPF, noobservationuncertainties) & (HTK+DS+NE) 29

  30. Results and Conclusions • Conclusions • Beamforming provides an opportunity to estimate not only the clean signal but also its standard error. • This error - the observation uncertainty - can be propagated to the MFCC domain or an other suitable domain for improving ASR by uncertainty-of-observation techniques. • Best results were attained for uncertainty propagation with modified imputation. • Training is critical, and despite strange philosophical implications, observation uncertainties improve the behaviour after multicondition training as well. • Strategy is simple & easily generalizes to LVCSR. 30

  31. Thank you ! 31

  32. Further Improvements • Training: MCE-Guided Training • Iteration andsplittingcontrolisdonebyminimumclassificationerror (MCE) criterion on held-out dataset. • Algorithmformixturesplitting: • initializesplitdistanced • whilem < numMixtures • split all mixturesbydistancedalong 1st eigenvector • carry out re-estimationsuntilaccuracyimprovesnomore • ifaccm >= accm-1 • m = m+1 • else • go back toprevious model • d = d/f 32

More Related