310 likes | 515 Views
Object Tracking and Asynchrony in Audio-Visual Speech Recognition. Mark Hasegawa-Johnson AIVR Seminar August 31, 2006. AVICAR is thanks to: Bowon Lee, Ming Liu, Camille Goudeseune, Suketu Kamdar, Carl Press, Sarah Borys and to the Motorola Communications Center
E N D
Object Tracking and Asynchrony in Audio-Visual Speech Recognition Mark Hasegawa-Johnson AIVR Seminar August 31, 2006 AVICAR is thanks to: Bowon Lee, Ming Liu, Camille Goudeseune, Suketu Kamdar, Carl Press, Sarah Borys and to the Motorola Communications Center Some experiments and most good ideas in this talk thanks to Ming Liu, Karen Livescu, Kate Saenko and Partha Lal
Why AVSR is not like ASR • Use of classifiers as features • E.g., output of an AdaBoost lip tracker is feature in a face constellation • Obstruction • Tongue is rarely visible, glottis never • Asynchrony • Visual evidence for a word can start long before the audio evidence Which digit is she about to say?
Why ASR is like AVSR • Use of classifiers as features • E.g., neural networks or SVMs transform audio spectra into a phonetic feature space • Obstruction • Lip closure “hides” tongue closure • Glottal stop “hides” lip or tongue position • Asynchrony • Tongue, lips, velum, and glottis can be out of sync, e.g., “every” →“ervy”
Discriminative Features in Face/Lip Tracking: AdaBoost • Each wavelet defines a “weak classifier: hi(x) = 1 iff fi > threshold, else hi(x) = 0 • Start with equal weight for all training tokens: wm(1) = 1/M, 1≤ m≤M • For each learning iteration t: • Find i that minimizes the weighted training error. • wm ↓if token m was correctly classified, else wm ↑. • αt = log((1- εt)/ εt) • Final “strong classifier” is H(x) = 1 iff Σtαt ht(x) > Σtαt
AdaBoost in a Bayesian Context • The AdaBoost “margin:” • Guaranteed range: 0≤MD(x)≤1 • Inverse sigmoid transform yields nearly normal distributions
Prior: Relative Position of Lips in the Face p(r=rlips | MD(x)) a p(r=rlips) p(MD(x) | r=rlips)
Model-Based Correction for Head-Pose Variability • If the head is an ellipse, its measured width wF(t) and height hF(t) are functions of roll ρ, yaw ψ, pitch φ, true height ħF and true width wFaccording to • … which can usefully be approximated as…
Robust Correction: Linear Regression • The additive random part of the lip width (wL(t)=w1+ħLcosψ(t)sinρ(t)) is proportional to similar additive variation in the head width (wF(t)=wF1+ħFcosψ(t)sinρ(t)), so we can eliminate it by orthogonalizing wL(t) to wF(t).
WER Results from AVICAR(Testing on the training data; 34 talkers, continuous digits) LR = linear regression Model = model-based head-pose compensation LLR = log-linear regression 13+d+dd = 13 static features 39 = 39 static features All systems have mean and variance normalization and MLLR
Audio-Visual Asynchrony For example, tongue touches the teeth before acoustic speech onset in the word “three;” lips are already round in anticipation of the /r/.
Audio-Visual Asynchrony: Coupled HMM is a typical Phoneme-Viseme Model(Chu and Huang, 2002)
TB-LOC TT-LOC TB-OP TT-OP LIP-LOC VELUM LIP-OP GLOTTIS A Physical Model of AsynchronySlide created by Karen Livescu Articulatory Phonology [Browman & Goldstein ‘90]: The following 8 tract variables are independently & asynchronously controlled For speech recognition, we collapse these into 3 streams: lips, tongue, and glottis (LTG).
word don’t probably baseform p r aa b ax b l iy d ow n t (2) p r aa b iy (1) p r ay (1) p r aw l uh (1) p r ah b iy (1) p r aa l iy (1) p r aa b uw (1) p ow ih (1) p aa iy (1) p aa b uh b l iy (1) p aa ah iy (37) d ow n (16) d ow (6) ow n (4) d ow n t (3) d ow t (3) d ah n (3) ow (3) n ax (2) d ax n (2) ax (1) n uw ... surface (actual) Motivation: Pronunciation variationSlide created by Karen Livescu everybody sense s eh n s eh v r iy b ah d iy (1) s eh n t s (1) s ih t s (1) eh v r ax b ax d iy (1) eh v er b ah d iy (1) eh ux b ax iy (1) eh r uw ay (1) eh b ah iy
n Explanation: Asynchrony of tract variablesBased on a slide created by Karen Livescu feature values G open critical open nasal dictionary closed / alveolar crit / alveolar T crit / alveolar mid / palatal phone s eh n s surface variant #1 feature values G open critical open nas (example of feature asynchrony) closed / alveolar crit / alveolar T crit / alveolar mid / palatal phone s eh n t s surface variant #2 feature values G open critical critical open nas (example of feature asynchrony + substitution) T crit / alveolar cl / alv crit / alveolar nar / palatal phone s ih t s
Implementation: Multi-stream DBNSlide created by Karen Livescu q (phonetic state) • Phone-based o (observation vector) • Articulatory Feature-based L (state of lips) T (state of tongue) G (state of glottis) o (obs vector)
Baseline: Audio-only phone-based HMMSlide created by Partha Lal positionInWordA {0,1,2,...} stateTransitionA {0,1} phoneStateA { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …} obsA
Baseline: Video-only phone-based HMMSlide created by Partha Lal positionInWordV {0,1,2,...} stateTransitionV {0,1} phoneStateV { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …} obsV
Audio-visual HMM without asynchronySlide created by Partha Lal positionInWord {0,1,2,...} stateTransition {0,1} phoneState { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …} obs obsV obsA
Phoneme-Viseme CHMMSlide created by Partha Lal positionInWordA {0,1,2,...} stateTransitionA {0,1} phoneStateA { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …} obsA positionInWordV {0,1,2,...} stateTransitionV {0,1} phoneStateV { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …} obsV
Articulatory Feature CHMM positionInWordL {0,1,2,...} stateTransitionL{0,1} L { /OP/1, /OP/2, /RND/1, …} positionInWordT {0,1,2,...} stateTransitionT {0,1} T { /CL-ALV/1, /CL-ALV/2, /MID-UV/1, …} positionInWordG {0,1,2,...} stateTransitionG {0,1} G { /OP/1, /OP/2, /CRIT/1, …} obsV obsA
Asynchrony Experiments: CUAVE • 169 utterances used, 10 digits each • NOISEX speech babble added at various SNRs • Experimental setup • Training on clean data, number of Gaussians tuned on clean dev set • Audio/video weights tuned on noise-specific dev sets • Uniform (“zero-gram”) language model • Decoding constrained to 10-word utterances (avoids language model scale/penalty tuning) • Thanks to Amar Subramanya at UW for the video observations • Thanks to Kate Saenko at MIT for initial baselines and audio observations
Results, part 1:Should we use video?Answer: Fusion WER < Single-stream WER( Novelty: None. Many authors have reported this. )
Results, part 2:Should the streams be asynchronous?Asynchronous WER < Synchronous WER (4% absolute @ midSNRs)( Novelty: First phone-based AVSR w/ inter-phone asynchrony. )
Results, part 3:Should asynchrony be modeled using articulatory features?Answer: Articulatory feature WER = Phoneme-viseme WER( Novelty: First articulatory feature model for AVSR. )
PV = Phone-viseme AF = Articulatory features WER on devtest, averaged across SNRs Results, part 4:Can AF system help the CHMM to correct mistakes?Answer: Combination AF + PV gives best results on this databaseDetails: Systems vote to determine label of each word (NIST rover)
Conclusions • Classifiers as features: • AdaBoost “margin” outputs can be used as features in Gaussian model of facial geometry • Head-pose correction in noise: • Best correction algorithm uses linear regression followed by model-based correction • Asynchrony matters: • Best phone-based recognizer is a CHMM with two states of asynchrony allowed between audio and video • Articulatory Feature Models complement Phone Models • These two systems have identical WER • Best result obtained when systems of both types are combined using rover