Dealing with Acoustic Noise Part 3: Video

Dealing with Acoustic Noise Part 3: Video Mark Hasegawa-Johnson University of Illinois Lectures at CLSP WS06 July 20, 2006

Audio-Visual Speech Recognition: WS00 and WS06 • Visual speech features • DCT of lip rectangle • Active Appearance Models • Feature normalization • Mean and variance normalization • MLLR, fiducial-point LR and logLR • LDA and PCA • Audio – Video fusion • Two-stream HMM • Product HMM & Coupled HMM • Streams based on constriction states

Face & Lip Tracking

Object Tracking: Multi-Resolution(Neti et al., 2000) • Computational complexity of lip tracking: • In a 480x720 image, there are 1.2x1011 candidate lip rectangles. • Multi-resolution lip tracking: • Train lip detectors at each resolution, i.e., corners of lip rectangle must be integer multiples of Ri • Beam search: • Keep the N best candidates at resolution Ri • At resolution Ri+1, consider only the candidates within +/- Ri/2 of a candidate from the N-best list at resolution Ri. • Tune R1,R2,… and N to trade off accuracy for computational complexity

Object Tracking: Fast Features(Viola and Jones, 2001)

Object Tracking: AdaBoost(Schapire, 1999) • Each Viola-Jones feature defines a “weak classifier:” habcdi(x) = 1 iff fi(a,b,c,d) > threshold, else habcdi(x) = 0 • Start with equal weight for all training tokens wm(1) = 1/M, 1≤ m≤M • For each learning iteration t: • Find the (a,b,c,d,i) that minimizes the weighted training error, εt. • wm(t+1)=wm(t)(1- εt)/ εt if token m was incorrectly classified, else wm(t+1)=wm(t). Then renormalize so Σmwm(t)=1. • αt = log((1- εt)/ εt) • Final “strong classifier” is H(x) = 1 iff Σtαt ht(x) > Σtαt

AdaBoost in a Bayesian Context • p(MD(x) | Ci) is well approximated by a Gaussian • Ci=0: object absent • Ci=1: object present • MD(x) defined by • Probability distribution of face center (x,y), and log (width, height), well modeled by Gaussians • Probability distribution of lip center (x,y) and size (w,h) relative to face (normalized to the range [-1,1]) is compact and unimodal • Find a lip rectangle and face rectangle that jointly maximize product of probabilities

Pixel-Based Features

Pixel-Based Features: Dimension

Geometry Features can be useful for AVSR(Chu and Huang, 2002, using features of Tsuhan Chen) Visual Feature Extraction

Combining Geometry + Pixels: AAM(Neti et al., WS00)

Constellation Models(Koch) • Each patch is recognized by a likelihood p(pixels) • Relative geometries controlled by a geometry PDF • Advantages: • Good object detection accuracy • Provides information about object components • Disadvantage: computational complexity

AVICAR “Constellation” • Four face rectangles provide information about face location, width, and height (useful for normalization) • Positions of lip rectangles within four faces provide information about head angle (useful for normalization) • Lip height, width provide information about whether mouth is open or closed (useful for speech recognition) • DCT of pixels within all four lip rectangles gives information about teeth and tongue (useful for speech recognition)

Visual Noise • Lighting Variability • Physical model • Variance normalization • Head-Pose Variability • Physical model • Linear and log-linear regression • Dimensionality Reduction • Linear discriminant analysis • Within-condition PCA • Facial Feature Variability • MLLR

Lighting Variability • Physical model (isotropic reflection): measured (r,g,b) is the product of the direction-independent reflectance (γr, γg, γb,t) of a moving fleshpoint, times its lighting (λ r, λ g, λb). • Solution: variance normalization

Lighting Variability • …but time-varying high-contrast lighting would fool it • Variance normalization is useful even if the lip rectangle is marked by high-contrast lighting…

Head-Pose Variability • If the head is an ellipse, its measured width wF(t) and height hF(t) are functions of roll ρ, yaw ψ, pitch φ, true height ħF and true width wFaccording to • … which can usefully be approximated as…

Linear Regression • The additive random part of the lip width (wL(t)=w1+ħLcosψ(t)sinρ(t)) is proportional to similar additive variation in the head width (wF(t)=wF1+ħFcosψ(t)sinρ(t)), so we can eliminate it by orthogonalizing wL(t) to wF(t).

Log Linear Regression • The multiplicative random part of the lip width (w1(t)=w2cosψ(t)cosρ(t)) is proportional to similar multiplicative variation in the head width (wF(t)=wFcosψ(t)cosρ(t)), so we can eliminate it by orthogonalizing log wL(t) to log wF(t).

Facial Feature Variability • … tends to result in large changes in the feature mean (e.g., different talkers have different average lip-rectangle sizes) • Changes in the class-dependent feature mean can be compensated by MLLR

WER Results from AVICAR LR = linear regression Model = model-based head-pose compensation LLR = log-linear regression 13+d+dd = 13 static features 39 = 39 static features All systems have mean and variance normalization and MLLR

Audio-Visual Asynchrony For example, tongue touches the teeth before acoustic speech onset in the word “three;” lips are already round in anticipation of the /r/.

Audio-Visual Asynchrony: Coupled HMM is a typical Phoneme-Viseme Model

Asynchrony in Gestural Phonology “three” Round Spread Lips Dental Critical Retroflex Narrow Palatal Narrow Tongue Unvoiced Voiced Glottis time

Modeling Asynchrony Using Constriction State Variables Wordt Wordt+1 Glottist Glottist+1 Tonguet Tonguet+1 Lipst Lipst+1 Audiot Audiot+1 Videot Videot+1

Summary • Video feature extraction: it works! • Audiovisual fusion using GMTK: • Partha has the phoneme-viseme model working • Articulatory feature model is in progress

Dealing with Acoustic Noise Part 3: Video

Dealing with Acoustic Noise Part 3: Video

Presentation Transcript

VLT ® Low Harmonic Drive

Audio Acoustic Foam Varieties

Noise Figure, Noise Factor and Sensitivity

Noise Compatible Land Use Planning

Jitter, Shimmer, and Noise in Pathological Voice Quality Perception

Dealing with Acoustic Noise Part 1: Spectral Estimation

Interpretation

Acoustic Enclousers

Chapter 5 Signals and Noise 1 Signals and Noise

Noise Barriers / Acoustic Partition Wall

Harmonics Analysis in Control System

Self Noise

Lock-in amplifiers

Chapter 7

Ambient hydro-acoustic noise in the ocean – impact of merchant ships, and developments at IMO

Different Types Of Soundproofing Pipes And Ducts Available With Acoustic Guard

Different Acoustical Sound Proofing Products Offered By Acoustic Guard

How Acoustic Panels can Help Users Absorb Sound and Control Noise Effectively

How Acoustic Panels can Help Users Absorb Sound and Control Noise Effectively

Acoustic Panelling

Recycled Plastic Noise Barrier

Simple Way to Remove Video Noise and Stablize Noisy Video