260 likes | 277 Views
This lecture covers topics such as visual speech features, lip tracking, object tracking, computational complexity, AdaBoost, and AVICAR. It also discusses techniques for dealing with visual noise.
E N D
Dealing with Acoustic Noise Part 3: Video Mark Hasegawa-Johnson University of Illinois Lectures at CLSP WS06 July 20, 2006
Audio-Visual Speech Recognition: WS00 and WS06 • Visual speech features • DCT of lip rectangle • Active Appearance Models • Feature normalization • Mean and variance normalization • MLLR, fiducial-point LR and logLR • LDA and PCA • Audio – Video fusion • Two-stream HMM • Product HMM & Coupled HMM • Streams based on constriction states
Object Tracking: Multi-Resolution(Neti et al., 2000) • Computational complexity of lip tracking: • In a 480x720 image, there are 1.2x1011 candidate lip rectangles. • Multi-resolution lip tracking: • Train lip detectors at each resolution, i.e., corners of lip rectangle must be integer multiples of Ri • Beam search: • Keep the N best candidates at resolution Ri • At resolution Ri+1, consider only the candidates within +/- Ri/2 of a candidate from the N-best list at resolution Ri. • Tune R1,R2,… and N to trade off accuracy for computational complexity
Object Tracking: AdaBoost(Schapire, 1999) • Each Viola-Jones feature defines a “weak classifier:” habcdi(x) = 1 iff fi(a,b,c,d) > threshold, else habcdi(x) = 0 • Start with equal weight for all training tokens wm(1) = 1/M, 1≤ m≤M • For each learning iteration t: • Find the (a,b,c,d,i) that minimizes the weighted training error, εt. • wm(t+1)=wm(t)(1- εt)/ εt if token m was incorrectly classified, else wm(t+1)=wm(t). Then renormalize so Σmwm(t)=1. • αt = log((1- εt)/ εt) • Final “strong classifier” is H(x) = 1 iff Σtαt ht(x) > Σtαt
AdaBoost in a Bayesian Context • p(MD(x) | Ci) is well approximated by a Gaussian • Ci=0: object absent • Ci=1: object present • MD(x) defined by • Probability distribution of face center (x,y), and log (width, height), well modeled by Gaussians • Probability distribution of lip center (x,y) and size (w,h) relative to face (normalized to the range [-1,1]) is compact and unimodal • Find a lip rectangle and face rectangle that jointly maximize product of probabilities
Geometry Features can be useful for AVSR(Chu and Huang, 2002, using features of Tsuhan Chen) Visual Feature Extraction
Constellation Models(Koch) • Each patch is recognized by a likelihood p(pixels) • Relative geometries controlled by a geometry PDF • Advantages: • Good object detection accuracy • Provides information about object components • Disadvantage: computational complexity
AVICAR “Constellation” • Four face rectangles provide information about face location, width, and height (useful for normalization) • Positions of lip rectangles within four faces provide information about head angle (useful for normalization) • Lip height, width provide information about whether mouth is open or closed (useful for speech recognition) • DCT of pixels within all four lip rectangles gives information about teeth and tongue (useful for speech recognition)
Visual Noise • Lighting Variability • Physical model • Variance normalization • Head-Pose Variability • Physical model • Linear and log-linear regression • Dimensionality Reduction • Linear discriminant analysis • Within-condition PCA • Facial Feature Variability • MLLR
Lighting Variability • Physical model (isotropic reflection): measured (r,g,b) is the product of the direction-independent reflectance (γr, γg, γb,t) of a moving fleshpoint, times its lighting (λ r, λ g, λb). • Solution: variance normalization
Lighting Variability • …but time-varying high-contrast lighting would fool it • Variance normalization is useful even if the lip rectangle is marked by high-contrast lighting…
Head-Pose Variability • If the head is an ellipse, its measured width wF(t) and height hF(t) are functions of roll ρ, yaw ψ, pitch φ, true height ħF and true width wFaccording to • … which can usefully be approximated as…
Linear Regression • The additive random part of the lip width (wL(t)=w1+ħLcosψ(t)sinρ(t)) is proportional to similar additive variation in the head width (wF(t)=wF1+ħFcosψ(t)sinρ(t)), so we can eliminate it by orthogonalizing wL(t) to wF(t).
Log Linear Regression • The multiplicative random part of the lip width (w1(t)=w2cosψ(t)cosρ(t)) is proportional to similar multiplicative variation in the head width (wF(t)=wFcosψ(t)cosρ(t)), so we can eliminate it by orthogonalizing log wL(t) to log wF(t).
Facial Feature Variability • … tends to result in large changes in the feature mean (e.g., different talkers have different average lip-rectangle sizes) • Changes in the class-dependent feature mean can be compensated by MLLR
WER Results from AVICAR LR = linear regression Model = model-based head-pose compensation LLR = log-linear regression 13+d+dd = 13 static features 39 = 39 static features All systems have mean and variance normalization and MLLR
Audio-Visual Asynchrony For example, tongue touches the teeth before acoustic speech onset in the word “three;” lips are already round in anticipation of the /r/.
Audio-Visual Asynchrony: Coupled HMM is a typical Phoneme-Viseme Model
Asynchrony in Gestural Phonology “three” Round Spread Lips Dental Critical Retroflex Narrow Palatal Narrow Tongue Unvoiced Voiced Glottis time
Modeling Asynchrony Using Constriction State Variables Wordt Wordt+1 Glottist Glottist+1 Tonguet Tonguet+1 Lipst Lipst+1 Audiot Audiot+1 Videot Videot+1
Summary • Video feature extraction: it works! • Audiovisual fusion using GMTK: • Partha has the phoneme-viseme model working • Articulatory feature model is in progress