380 likes | 545 Views
ICCS-NTUA : WP1+WP2. Prof. Petros Maragos NTUA, School of ECE URL: http://cvsp.cs.ntua.gr. HIWIRE. Computer Vision, Speech Communication and Signal Processing Research Group. ICCS-NTUA in HIWIRE. Evaluation Databases & Baseline Completed Platform Front-end Release 1 st Version WP1
E N D
ICCS-NTUA : WP1+WP2 Prof. Petros Maragos NTUA, School of ECE URL: http://cvsp.cs.ntua.gr HIWIRE Computer Vision, Speech Communication and Signal Processing Research Group
HIWIRE Meeting, July 2006 ICCS-NTUA in HIWIRE Evaluation Databases & BaselineCompleted Platform Front-endRelease 1st Version WP1 Noise Robust FeaturesCompleted Multi-mic. array EnhancementPrelim. Results Fusion Prelim. Results Audio-Visual ASRBaseline + Adv. Visual Features VADCompleted + Integration WP2 VTLN Platform Integration Completed Speaker Normalization ResearchPrelim. Results Non-native Speech Database Completed
HIWIRE Meeting, July 2006 ICCS-NTUA in HIWIRE • Evaluation • Databases & BaselineCompleted • Platform Front-endRelease 1st Version • WP1 • Noise Robust FeaturesCompleted • Multi-mic. array EnhancementPrelim. Results • FusionPrelim. Results • Audio-Visual ASRBaseline + Adv. Visual Features • VADCompleted + Integration • WP2 • VTLN Platform IntegrationCompleted • Speaker Normalization ResearchPrelim. Results • Non-native Speech DatabaseCompleted
HIWIRE Meeting, July 2006 HIWIRE Advanced Front-end: Challenges Points Considered during Implementation • Modular Architecture • Implementation in C-Code • Incorporation of Different Ideas/Algorithms • User-friendly interface providing additional options dealing with on-site demands of the project
HIWIRE Meeting, July 2006 Speech Signals Want VAD? No Yes LTSDVAD / MTEVAD Speech Pre-Processing (Denoising) No Want Denoising? Yes Wiener Denoising MFCC/ TECC MFCC TECC Speech Processing (Features) HIWIRE Advanced Front-end: Options 1 1 • Support for Input Speech Signals • Different Sampling Frequencies • 8 kHz • 11 kHz • 16 kHz • Different Byte-Ordering • Little-endian • Big-endian • Different Input File Formats • RAW • NIST • HTK 2 2 • Provides Flags/ Options: • Preprocessing Smoothing of Speech Signals • Hamming Windowing • Pre-emphasis • Denoising/ VAD Algorithms • LTSD-VAD Algorithm (UGR) • MTE-VAD Algorithm (ICCS-NTUA) • Wiener Denoising Algorithm- • (Used only with a VAD algorithm) • Output Features • MFCC • TECC • C0 or logE 3 3
HIWIRE Meeting, July 2006 HIWIRE Advanced Front-end: Things to Be Done • Script is in Testing Phase • Create a CVS where Additional Modules • should be included • Tested Further in Speech Databases • Evaluation in progress • Fine-Tuning is Necessary • Final Version should be Faster (Real-Time Processing) • Incorporate it in the HIWIRE Platform
HIWIRE Meeting, July 2006 ICCS-NTUA in HIWIRE: 1st, 2ndYear • Evaluation • Databases & BaselineCompleted • Platform Front-endRelease 1st Version • WP1 • Noise Robust FeaturesCompleted • Multi-mic. array EnhancementPrelim. Results • FusionPrelim. Results • Audio-Visual ASRBaseline + Adv. Visual Features • VADCompleted + Integration? • WP2 • VTLN Platform IntegrationCompleted • Speaker Normalization ResearchPrelim. Results • Non-native Speech DatabaseCompleted
Microphone Arrays • Multi-channel Speech Enhancement for Diffuse Noise Fields • MVDR (Minimum Variance Distortionless Response) Beamforming • Single Channel Linear and non-linear Post-Filtering • MSE criterion leads to the linear Wiener Post-filter. • MSE STSA and MSE log-STSA criteria leads to non-Linear Post-filters. HIWIRE Meeting, July 2006
HIWIRE Meeting, July 2006 Microphone Arrays • The Overall Speech Enhancement System includes the following steps: • The noisy channel’s inputs are fed into a time alignment module (Different propagation paths for every input channel) • The time aligned noisy observations are projected to a single channel output with minimum noise variance, through the MVDR beamformer. • The output of the beamformer is further processed by a post-filter according to the used speech enhancement criterion (MSE, MSE STSA, MSE log-STSA). • For the post-filters, since they depend on second order statistics of the source and the noise signals, we have to develop an estimation scheme. • Results on CMU Database • 10 Speakers (13 utterances) • Diffuse Noise • SSNR Enhancement : SSNRoutput-E[SSNRinput] (E[] stands for the mean value of the N input channels) • LAR, LSD, IS, LLR : Low values signify high speech quality. These measures are found to have a high correlation with the human perception.
HIWIRE Meeting, July 2006 Results: CMU Database
HIWIRE Meeting, July 2006 Spectrograms: CMU Database
Multi-Microphone ASR Experiments • Details on Setup of ASR Tasks: • 700 Sentences for Training and • 300 for Testing • 12-state, left-right HMM w. • Gaussian mixtures • All-pair, unweighted grammar • MFCC+C0+D+DD (39 coefficients in total) HIWIRE Meeting, July 2006
HIWIRE Meeting, July 2006 ICCS-NTUA in HIWIRE: 1st, 2ndYear • Evaluation • Databases & BaselineCompleted • Platform Front-endRelease 1st Version • WP1 • Noise Robust FeaturesCompleted • Multi-mic. array EnhancementPrelim. Results • FusionPrelim. Results • Audio-Visual ASRBaseline + Adv. Visual Features • VADCompleted + Integration? • WP2 • VTLN Platform IntegrationCompleted • Speaker Normalization ResearchPrelim. Results • Non-native Speech DatabaseCompleted
HIWIRE Meeting, July 2006 Multi-Cue Feature Fusion • Goal: • Fuse heterogeneous information streams optimally & adaptively • Our approach: • Explicitly model uncertainty in all feature measurements (due to noise or model fitting errors) • Adjust model training to accommodate for uncertainty • Dynamically compensate feature uncertainty during decoding • Feature uncertainty estimation in the AV-ASR case: • For the Audio Stream/MFCC: speech enhancement process • For the Visual Stream: model fitting variance • Properties: • Adaptation at the frame level • Explain and generalize cue weighting through stream exponents • Integrates with a wide range of models, e.g. GMM, HMM • Applicable to both audio-audio and audio-visual scenarios • Can be combined with asynchronous models, e.g. Product-HMM
C C X X Y Measurement Noise and Adaptive Fusion Conventional View: Features are directly observable Our View: We can only measure noise-corrupt features Ref: Katsamanis, Papandreou, Pitsikalis, and Maragos, EUSIPCO’06 HIWIRE Meeting, July 2006
C C X X Y EM-Training with Partially Known Features • Even training data can be uncertain Hidden Conventional View Observed Hidden Our View Observed HIWIRE Meeting, July 2006 Ref: Papandreou, Katsamanis, Pitsikalis, and Maragos, submission to NIPS’06
EM-Training: Results for GMM E-Step Similar to conventional update rules Uncertainty-compensated scores M-Step Filtered feature estimate • Formulas for HMM are similar HIWIRE Meeting, July 2006
Decoding & Uncertain Features • Variance-Compensated (“Soft”) Scoring • Probabilistic Justification for Stream Exponents Relative Measurement Error Adaptation at each frame –stream/class/mixture dependent stream weights HIWIRE Meeting, July 2006
HIWIRE Meeting, July 2006 Audio-visual Asynchrony Modeling Multi-stream HMM Product HMM Ref: Gravier et al., 2002
HIWIRE Meeting, July 2006 Fusion: Multi-Cue Audio-Audio Feature Uncertainty for Audio features Baseline Audio Features: MFCC Enhancement using GMM of clean speech and Vector Taylor Series Approximation Uncertainty is Gaussian with Variance given by the enhancement process Used for Audio-Visual Fusion Fractal Audio Features: MFD On-going research applying a similar framework (GMM, VTS)
MFD: From Noisy Speech to Feature Uncertainty • Ongoing Research: Noise Compensation for MFD True Noisy Noise Estimated Noisy White Noise (0 dB) Clean HIWIRE Meeting, July 2006
HIWIRE Meeting, July 2006 ICCS-NTUA in HIWIRE: 1st, 2ndYear • Evaluation • Databases & BaselineCompleted • Platform Front-endRelease 1st Version • WP1 • Noise Robust FeaturesCompleted • Multi-mic. array EnhancementPrelim. Results • FusionPrelim. Results • Audio-Visual ASRBaseline + Adv. Visual Features • VADCompleted + Integration? • WP2 • VTLN Platform IntegrationCompleted • Speaker Normalization ResearchPrelim. Results • Non-native Speech DatabaseCompleted
HIWIRE Meeting, July 2006 = = Showcase: Audio-Visual Speech Recognition • Both shape & texture can assist lipreading • Active Appearance Models for face modeling • Shape and texture of faces “live” in low-dim manifolds • Features: AAM Fitting (nonlinear least squares problem) • Visual feature Uncertainty related to the sensitivity of the least-squares solution
HIWIRE Meeting, July 2006 Demo: AAM fitting and uncertainty estimates • The visual front-end supplies bothfeatures and their respective uncertainty.
HIWIRE Meeting, July 2006 Audio-Visual ASR: Database Subset of CUAVE database used: 36 speakers (30 training, 6 testing) 5 sequences of 10 connected digits per speaker Training set: 1500 digits (30x5x10) Test set: 300 digits (6x5x10) CUAVE database also contains more complex data sets: speaker moving around, speaker shows profile, continuous digits, two speakers (to be used in future evaluations) CUAVE was kindly provided by the Clemson University
HIWIRE Meeting, July 2006 Evaluation on the CUAVE Database
HIWIRE Meeting, July 2006 Audio-Visual Speech Classification with MS-HMM Ref: Katsamanis, Papandreou, Pitsikalis, and Maragos, EUSIPCO’06
AV Digit Classification Results (Word Accuracy) • Audio: MFCC_D_Z (26 features) • Visual: 6 shape + 12 texture AAM coefficients • AV MS-HMM: AudioVisual Multistream HMM, weights (1,1) • AV MS-HMM, Var-Comp: AudioVisual Multistream HMM+Variance Compensation • AV P-HMM: AudioVisual Product HMM, weights (1,1) • AV P-HMM, Var-Comp: AudioVisual Product HMM+ Variance Compensation HIWIRE Meeting, July 2006 Ref: Pitsikalis, Katsamanis, Papandreou, and Maragos, ICSLP’06
HIWIRE Meeting, July 2006 AV-ASR: Results with Uncertain Training Ref: Papandreou, Katsamanis, Pitsikalis, and Maragos, submission to NIPS’06
HIWIRE Meeting, July 2006 ICCS-NTUA in HIWIRE: 1st, 2ndYear • Evaluation • Databases & BaselineCompleted • Platform Front-endRelease 1st Version • WP1 • Noise Robust FeaturesCompleted • Multi-mic. array EnhancementPrelim. Results • Fusion Prelim. Results • Audio-Visual ASRBaseline + Adv. Visual Features • VADCompleted + Integration? • WP2 • VTLN Platform IntegrationCompleted • Speaker Normalization ResearchPrelim. Results • Non-native Speech DatabaseCompleted
HIWIRE Meeting, July 2006 VTLN on the Platform • Warping in the front-end • Piecewise Linear Warping Function • Warping in the filterbank domain by stretching or compressing the frequency axis • Training – HTK Implementation • Testing • Fast Implementation using GMM representing normalized speech to estimate warping factors per utterance.
VTLN on the Platform, Results HIWIRE Meeting, July 2006
HIWIRE Meeting, July 2006 VTLN Research, TECC Features • Teager Energy Cepstrum Coefficients are actually energy measurements at the output of a Gammatone filterbank, similarly to MFCC • VTLN can be applied in a similar manner • The bark scale along which the filters are uniformly positioned is properly stretched or shrunk to achieve warping • Evaluation is currently in progress
VTLN Research, using Formants HIWIRE Meeting, July 2006
HIWIRE Meeting, July 2006 node time Raw Formants-Dynamic Programming
Formant Tracking HIWIRE Meeting, July 2006
HIWIRE Meeting, July 2006 ICCS-NTUA in HIWIRE: 1st, 2ndYear • Evaluation • Databases & BaselineCompleted • PlatformRelease 1st Version • WP1 • Noise Robust FeaturesCompleted • Multi-mic. array EnhancementPrelim. Results • Fusion Prelim. Results • Audio-Visual ASRBaseline + Adv. Visual Features • VADCompleted + Integration? • WP2 • VTLN Platform Integration Completed • Speaker Normalization ResearchPrelim. Results • Non-native Speech Database Completed
HIWIRE Meeting, July 2006 Next... Fusion Audio+Audio, Audio+Visual, Nonlinear Features+Visual Visual Front-end VAD+ Nonlinear Features