1 / 30

Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Object Tracking and Asynchrony in Audio-Visual Speech Recognition. Mark Hasegawa-Johnson AIVR Seminar August 31, 2006. AVICAR is thanks to: Bowon Lee, Ming Liu, Camille Goudeseune, Suketu Kamdar, Carl Press, Sarah Borys and to the Motorola Communications Center

emery
Download Presentation

Object Tracking and Asynchrony in Audio-Visual Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Object Tracking and Asynchrony in Audio-Visual Speech Recognition Mark Hasegawa-Johnson AIVR Seminar August 31, 2006 AVICAR is thanks to: Bowon Lee, Ming Liu, Camille Goudeseune, Suketu Kamdar, Carl Press, Sarah Borys and to the Motorola Communications Center Some experiments and most good ideas in this talk thanks to Ming Liu, Karen Livescu, Kate Saenko and Partha Lal

  2. Why AVSR is not like ASR • Use of classifiers as features • E.g., output of an AdaBoost lip tracker is feature in a face constellation • Obstruction • Tongue is rarely visible, glottis never • Asynchrony • Visual evidence for a word can start long before the audio evidence Which digit is she about to say?

  3. Why ASR is like AVSR • Use of classifiers as features • E.g., neural networks or SVMs transform audio spectra into a phonetic feature space • Obstruction • Lip closure “hides” tongue closure • Glottal stop “hides” lip or tongue position • Asynchrony • Tongue, lips, velum, and glottis can be out of sync, e.g., “every” →“ervy”

  4. Discriminative Features in Face/Lip Tracking: AdaBoost • Each wavelet defines a “weak classifier: hi(x) = 1 iff fi > threshold, else hi(x) = 0 • Start with equal weight for all training tokens: wm(1) = 1/M, 1≤ m≤M • For each learning iteration t: • Find i that minimizes the weighted training error. • wm ↓if token m was correctly classified, else wm ↑. • αt = log((1- εt)/ εt) • Final “strong classifier” is H(x) = 1 iff Σtαt ht(x) > Σtαt

  5. Example Haar Wavelet Features Selected by AdaBoost

  6. AdaBoost in a Bayesian Context • The AdaBoost “margin:” • Guaranteed range: 0≤MD(x)≤1 • Inverse sigmoid transform yields nearly normal distributions

  7. Prior: Relative Position of Lips in the Face p(r=rlips | MD(x)) a p(r=rlips) p(MD(x) | r=rlips)

  8. Lip Tracking: a few results

  9. Pixel-Based Features

  10. Pixel-Based Features: Dimension

  11. Model-Based Correction for Head-Pose Variability • If the head is an ellipse, its measured width wF(t) and height hF(t) are functions of roll ρ, yaw ψ, pitch φ, true height ħF and true width wFaccording to • … which can usefully be approximated as…

  12. Robust Correction: Linear Regression • The additive random part of the lip width (wL(t)=w1+ħLcosψ(t)sinρ(t)) is proportional to similar additive variation in the head width (wF(t)=wF1+ħFcosψ(t)sinρ(t)), so we can eliminate it by orthogonalizing wL(t) to wF(t).

  13. WER Results from AVICAR(Testing on the training data; 34 talkers, continuous digits) LR = linear regression Model = model-based head-pose compensation LLR = log-linear regression 13+d+dd = 13 static features 39 = 39 static features All systems have mean and variance normalization and MLLR

  14. Audio-Visual Asynchrony For example, tongue touches the teeth before acoustic speech onset in the word “three;” lips are already round in anticipation of the /r/.

  15. Audio-Visual Asynchrony: Coupled HMM is a typical Phoneme-Viseme Model(Chu and Huang, 2002)

  16. TB-LOC TT-LOC TB-OP TT-OP LIP-LOC VELUM LIP-OP GLOTTIS A Physical Model of AsynchronySlide created by Karen Livescu Articulatory Phonology [Browman & Goldstein ‘90]: The following 8 tract variables are independently & asynchronously controlled For speech recognition, we collapse these into 3 streams: lips, tongue, and glottis (LTG).

  17. word don’t probably baseform p r aa b ax b l iy d ow n t (2) p r aa b iy (1) p r ay (1) p r aw l uh (1) p r ah b iy (1) p r aa l iy (1) p r aa b uw (1) p ow ih (1) p aa iy (1) p aa b uh b l iy (1) p aa ah iy (37) d ow n (16) d ow (6) ow n (4) d ow n t (3) d ow t (3) d ah n (3) ow (3) n ax (2) d ax n (2) ax (1) n uw ... surface (actual) Motivation: Pronunciation variationSlide created by Karen Livescu everybody sense s eh n s eh v r iy b ah d iy (1) s eh n t s (1) s ih t s (1) eh v r ax b ax d iy (1) eh v er b ah d iy (1) eh ux b ax iy (1) eh r uw ay (1) eh b ah iy

  18. n Explanation: Asynchrony of tract variablesBased on a slide created by Karen Livescu feature values G open critical open nasal dictionary closed / alveolar crit / alveolar T crit / alveolar mid / palatal phone s eh n s surface variant #1 feature values G open critical open nas (example of feature asynchrony) closed / alveolar crit / alveolar T crit / alveolar mid / palatal phone s eh n t s surface variant #2 feature values G open critical critical open nas (example of feature asynchrony + substitution) T crit / alveolar cl / alv crit / alveolar nar / palatal phone s ih t s

  19. Implementation: Multi-stream DBNSlide created by Karen Livescu q (phonetic state) • Phone-based o (observation vector) • Articulatory Feature-based L (state of lips) T (state of tongue) G (state of glottis) o (obs vector)

  20. Baseline: Audio-only phone-based HMMSlide created by Partha Lal positionInWordA {0,1,2,...} stateTransitionA {0,1} phoneStateA { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …} obsA

  21. Baseline: Video-only phone-based HMMSlide created by Partha Lal positionInWordV {0,1,2,...} stateTransitionV {0,1} phoneStateV { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …} obsV

  22. Audio-visual HMM without asynchronySlide created by Partha Lal positionInWord {0,1,2,...} stateTransition {0,1} phoneState { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …} obs obsV obsA

  23. Phoneme-Viseme CHMMSlide created by Partha Lal positionInWordA {0,1,2,...} stateTransitionA {0,1} phoneStateA { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …} obsA positionInWordV {0,1,2,...} stateTransitionV {0,1} phoneStateV { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …} obsV

  24. Articulatory Feature CHMM positionInWordL {0,1,2,...} stateTransitionL{0,1} L { /OP/1, /OP/2, /RND/1, …} positionInWordT {0,1,2,...} stateTransitionT {0,1} T { /CL-ALV/1, /CL-ALV/2, /MID-UV/1, …} positionInWordG {0,1,2,...} stateTransitionG {0,1} G { /OP/1, /OP/2, /CRIT/1, …} obsV obsA

  25. Asynchrony Experiments: CUAVE • 169 utterances used, 10 digits each • NOISEX speech babble added at various SNRs • Experimental setup • Training on clean data, number of Gaussians tuned on clean dev set • Audio/video weights tuned on noise-specific dev sets • Uniform (“zero-gram”) language model • Decoding constrained to 10-word utterances (avoids language model scale/penalty tuning) • Thanks to Amar Subramanya at UW for the video observations • Thanks to Kate Saenko at MIT for initial baselines and audio observations

  26. Results, part 1:Should we use video?Answer: Fusion WER < Single-stream WER( Novelty: None. Many authors have reported this. )

  27. Results, part 2:Should the streams be asynchronous?Asynchronous WER < Synchronous WER (4% absolute @ midSNRs)( Novelty: First phone-based AVSR w/ inter-phone asynchrony. )

  28. Results, part 3:Should asynchrony be modeled using articulatory features?Answer: Articulatory feature WER = Phoneme-viseme WER( Novelty: First articulatory feature model for AVSR. )

  29. PV = Phone-viseme AF = Articulatory features WER on devtest, averaged across SNRs Results, part 4:Can AF system help the CHMM to correct mistakes?Answer: Combination AF + PV gives best results on this databaseDetails: Systems vote to determine label of each word (NIST rover)

  30. Conclusions • Classifiers as features: • AdaBoost “margin” outputs can be used as features in Gaussian model of facial geometry • Head-pose correction in noise: • Best correction algorithm uses linear regression followed by model-based correction • Asynchrony matters: • Best phone-based recognizer is a CHMM with two states of asynchrony allowed between audio and video • Articulatory Feature Models complement Phone Models • These two systems have identical WER • Best result obtained when systems of both types are combined using rover

More Related