680 likes | 806 Views
HIWIRE Progress Report. Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2). Outline. Work package 1 Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) Audio-Visual ASR: Baseline
E N D
HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)
Outline • Work package 1 • Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) • Audio-Visual ASR: Baseline • Feature extraction and combination • Segment models for ASR • Blind Source Separation for multi-microphone ASR • Work package 2 • Adaptation • Data collection
Outline • Work package 1 • Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) • Audio-Visual ASR: Baseline • Feature extraction and combination • Segment models for ASR • Blind Source Separation for multi-microphone ASR • Work package 2 • Adaptation • Data collection
Baseline • Baseline Performance Completed • Aurora 2 on HTK • Aurora 3 on HTK • Aurora 4 on HTK • Lattices for Aurora 4 • Baseline Performance Ongoing • WSJ1 (Decipher) • DMHMMs (Decipher)
Aurora 2 Database • Based on TIdigits downsampled to 8KHz • Noise artificially added at several SNRs • 3 sets of noises • A: subway, babble, car, exhib. hall • B: restaurant, street, airport, train station • C: subway, street (with different freq. characteristics) • Two training conditions • Training on clean data • Multi-condition Training on noisy data
Aurora 2 Database • 8440 training sentences • 1001 test sentences / test set • Three front-end configurations • HTK default • WI007 (Aurora 2 distribution) • WI008 (Thanks to Prof. Segura)
Aurora 2: Clean training • HTK default Front-End
Aurora 2: Multi-Condition training • HTK default Front-End
Aurora 3 Database • 5 languages • Finnish • German • Italian • Spanish • Danish • 3 noise conditions • quiet • low noisy (low) • high noisy (high) • 2 recording modes • close-talking microphone (ch0) • hands-free microphone (ch1)
Aurora 3 Database • 3 experimental setups • Well-Matched (WM) • 70% of all utts in “quiet, low, high” conditions were used for training • remaining 30% were used for testing • Medium Mismatched (MM) • 100% hands-free recordings from “quiet” and “low” for training • 100% hands-free recordings from “high” for testing • High Mismatched (HM) • 70% of close-talking recordings from all noise conditions for training • 30% of hands-free recordings from “low” and “high” for testing
Aurora 4 Database • Based on the WSJ phase 0 collection • 5000 word vocabulary • 7138 training data (ARPA evaluation) • 2 recording microphones • 6 different noises artificially added • Car, Babble, Restaurant, Street, Airport, TrainSt
Clean training Multicondition training 7138 utterances (as in the ARPA evaluation) 7138 utterances 3569 utterances (Sennheiser) 3569 utterances (2nd mic) 893 (no noise added) 2676 (1 out of 6 noises added at SNRs between 10 and 20 dB) 893 (no noise added) 2676 (1 out of 6 noises added at SNRs between 10 and 20 dB) Aurora 4 Training Data Sets • 3 Training Conditions • (Clean – MultiCondition – Noisy)
SET 1 SET 2 SET 3 SET 7 … 330 utt. (Sennheiser microphone) 330 utt. (Sennheiser mic; Noise 1 added at SNRs between 5 and 15 dB) 330 utt. (Sennheiser mic; Noise 2 added at SNRs between 5 and 15 dB) 330 utt. (Sennheiser mic; Noise 6 added at SNRs between 5 and 15 dB) SET 9 SET 10 SET 14 SET 8 … 330 utt. (2nd mic; Noise 1 added at SNRs between 5 and 15 dB) 330 utt. (2nd mic; Noise 2 added at SNRs between 5 and 15 dB) 330 utt. (2nd mic; Noise 6 added at SNRs between 5 and 15 dB) 330 utt. (2nd microphone) Aurora 4 Test Sets • 14 Test Sets • 2 sizes: small (166 utts) and large (330 utts)
Lattices • Obtained from SONIC recognizer • real time decoding for WSJ 5k task • State-of-the-art performance (8% WERR) • Lattices obtained from clean models • Three sizes lattices: small, medium, large • Fixed branching factor for each lattice size (small=2.5, medium=4, large=5.5) • Speed-up factor compared to HTK decoding: x100, x50, x10
Aurora4 BaselineConclusions on Lattices • Lattices speed up recognition • Medium Size Lattice is ~ 60 times faster • Small Size Lattice is ~ 108 times faster • Problem: improved performance in noisy test • Careful when using lattices in mismatched conditions (clean training-noisy data)! • Solution: • two sets of lattices lattices: matched, mismatched
Audio-Visual ASR: Database • Subset of CUAVE database used: • 36 speakers (30 training, 6 testing) • 5 sequences of 10 connected digits per speaker • Training set: 1500 digits (30x5x10) • Test set: 300 digits (6x5x10) • CUAVE database also contains more complex data sets: speaker moving around, speaker shows profile, continuous digits, two speakers (to be used in future evaluations)
Audio-Visual ASR: Feature Extraction • Lip region of interest (ROI) tracking • A fixed size ROI is detected using template matching • ROI minimizes RGB-Euclidean distance with a given ROI template • ROI template is selected from 1st frame of each speaker • Continuity constraint: search within a 20x20 pixel window of previous frame ROI (does not work for rapid speaker movements)
Audio-Visual ASR: Feature Extraction • Features extracted from ROI • ROI is transformed to grayscale • ROI is decimated to a 16x16 pixel region • 2D separable DCT is applied to 16x16 pixel region • Upper-left 6x6 region is kept (excluding first coef.) • 35 feature vector is resampled in time from 29.97 fps (NTSC) to 100 fps • First and second derivatives in time are computed using a 6 frame window (feature size 105) • Sanity check: unsupervised k-means clustering of ROI results in …
Experiments • Recognition experiment: • Open loop digit grammar (50 digits per utterance, no endpointing) • Classification experiment: • Single digit grammar (endpointed digits based on provided segmentation)
Models • Features: • Audio: 39 features (MFCC_D_A) • Visual: 105 features (ROIDCT_D_A) • Audio-Visual: 39+35 feats (MFCC_D_A+ROIDCT) • HMM models • 8 state, left-to-right HMM whole-digit models with no state skipping • Single Gaussian mixture • Audio-Visual HMM uses separate audio and video feature streams with equal weights (1,1)
Results (Word Accuracy] • Data • Training: 1500 digits (30 speakers) • Testing: 300 digits (6 speakers)
Future Work • Multi-mixture models • Front-end (NTUA) • Tracking algorithms • Feature extraction • Feature Combination • Feature integration • Feature weighting
Outline • Work package 1 • Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) • Audio-Visual ASR: Baseline • Feature extraction and combination • Segment models for ASR • Blind Source Separation for multi-microphone ASR • Work package 2 • Adaptation • Data collection
Feature extraction and combination • Noise Robust Features (NTUA) – m12 • AM-FM Features (NTUA) – m12 • Feature combination – m12 • Supra-segmental features (see also segment models) – m18
Outline • Work package 1 • Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) • Audio-Visual ASR: Baseline • Feature extraction and combination • Segment models for ASR • Blind Source Separation for multi-microphone ASR • Work package 2 • Adaptation • Data collection
Segment Models • Baseline system • Supra-segmental features • Phone Transition modeling – m12 • Prosody modeling – m18 • Stress modeling – m18 • Parametric modeling of feature trajectories • Dynamical system modeling • Combine with HMMs
Outline • Work package 1 • Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) • Audio-Visual ASR: Baseline • Feature extraction and combination • Segment models for ASR • Blind Source Separation for multi-microphone ASR • Work package 2 • Adaptation • Data collection
Blind Source Separation (Mokios, Sidiropoulos] • Based on PARallel FACtor (PARAFAC) analysis, i.e., low-rank decomposition of multi-dimensional tensorial data • Collecting spatial covariance matrix estimates which are sufficiently separated in time: • Assumptions • uncorrelated speaker signals and noise • D(t) is a diagonal matrix of speaker powers for measurement period t • denotes noise power (estimated from silence intervals)
Outline • Work package 1 • Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) • Audio-Visual ASR: Baseline • Feature extraction and combination • Segment models for ASR • Blind Source Separation for multi-microphone ASR • Work package 2 • Adaptation • Data collection
Acoustic Model Adaptation • Adaptation Method: • Bayes’ Optimal Classification • Acoustic Models: • Discrete Mixture HMMs
Bayes optimal classification • Classifier decision for a test data vector xtest: • Choose the class that results in the highest value:
Bayes optimal versus MAP • Assumption: the posterior is sufficiently peaked around the most probable point • MAP approximation: • θMAP is the set of parameters that maximize:
Why Bayes optimal classification • Optimal classification criterion • The prediction of all the parameter hypotheses is combined • Better discrimination • Less training data • Faster asymptotic convergence to the ML estimate
Why Bayes optimal classification • However: • Computationally more expensive • Difficult to find analytical solutions • ....hence some approximations should still be considered
Discrete-Mixture HMMs (Digalakis et. al. 2000) • It is based on sub-vector quantization • Introduces a new form of observation distributions
DMHMMs benefits (Digalakis et. al. 2000) • Speech Recognition performance driven quantization scheme • Quantization of the acoustic space in sufficient detail • Mixtures capture the correlation between sub-vectors • Well-matched in client-server applications • Comparable performance to continuous HMMs • Faster decoding speeds