Audio-Visual Graphical Models

Audio-Visual Graphical Models Nebojsa Jojic Microsoft Research Redmond, Washington Hagai Attias Microsoft Research Redmond, Washington Matthew Beal Gatsby Unit University College London

Overview • Some background to the problem • A simple video model • A simple audio model • Combining these in a principled manner • Results of tracking experiments • Further work and thoughts. Beal, Jojic and Attias, ICASSP’02

Motivation – applications • Teleconferencing • We need speaker’s identity, position, and individual speech. • The case of multiple speakers. • Denoising • Speech enhancement using video cues (at different scales). • Video enhancement using audio cues. • Multimedia editing • Isolating/removing/adding objects, visually and aurally. • Multimedia retrieval • Efficient multimedia searching. Beal, Jojic and Attias, ICASSP’02

Motivation – current state of art • Video models and Audio models • Abundance of work on object tracking, image stabilization… • Large amount in speech recognition, ICA (blind source separation), microphone array processing… • Very little work on combining these • We desire a principled combination. • Robust learning of environments using multiple modalities. • Various past approaches: • Information theory: Hershey & Movellan (NIPS 12) • SVD-esque: (FaceSync) Slaney & Covell (NIPS 13) • Subspace stats.: Fisher et al. (NIPS 13). • Periodicity analysis: Ross Cutler • Particle filters: Vermaak and Blake et al (ICASSP 2001). • System engineering: Yong Rui (CVPR 2001). • Our approach: Graphical Models, Bayes nets. Beal, Jojic and Attias, ICASSP’02

Generative density modeling • Probability models that • reflect desired structure • randomly generate plausible images and sounds, • represent the data by parameters • ML estimation • p(image|class) used for recognition, detection, ... • Examples: Mixture of Gaussians, PCA/FA/ICA, Kalman filter, HMM • All parameters can be learned from data! Beal, Jojic and Attias, ICASSP’02

camera mic.1 mic.2 µt source at lx Speaker detection & tracking problem Video scenario Audio scenario ly lx Beal, Jojic and Attias, ICASSP’02

Bayes Nets for Multimedia • Video models • Models such as Jojic & Frey (NIPS’99, CVPR’99’00’01). • Audio models • Work of: Attias (Neural Comp’98); Attias, Platt, Deng & Acero (NIPS’00,EuroSpeech’01). Beal, Jojic and Attias, ICASSP’02

A generative video model for scenes(see Frey&Jojic, CVPR’99, NIPS’01) Class s Mean s Latent image z Shift (lx,ly) Transformed image z Generated/observed image y Beal, Jojic and Attias, ICASSP’02

Mean One class summary Variance 5 classes Example • Hand-held camera • Moving subject • Cluttered background DATA Beal, Jojic and Attias, ICASSP’02

A generative video model for scenes(see Frey&Jojic, CVPR’99, NIPS’01) Class s Mean s Latent image z Shift (lx,ly) Transformed image z Generated/observed image y Beal, Jojic and Attias, ICASSP’02

A failure mode of this model Beal, Jojic and Attias, ICASSP’02

camera mic.1 mic.2 µt source at lx Modeling scenes - the audio part mic.1 mic.2 Beal, Jojic and Attias, ICASSP’02

+15 t +15 -15 t -15 time Unaided audio model audio waveform video frames • Posterior probability over t, the time delay. • Periods of quiet cause uncertainty in t – (grey blurring). • Occasionally reverberations / noise corrupt inference on t • and we become certain of a false time delay. Beal, Jojic and Attias, ICASSP’02

Limit of this simple audio model Beal, Jojic and Attias, ICASSP’02

Multimodal localization • Time delay t is approximately linear in horizontal position lx • Define a stochastic mapping from spatial location to temporal shift: Beal, Jojic and Attias, ICASSP’02

The combined model Beal, Jojic and Attias, ICASSP’02

The combined model • Two halves connected by t - lx link Maximize  nalog p(xt)+nvlog p(yt) Beal, Jojic and Attias, ICASSP’02

Learning using EM: E-Step Distribution Q over hidden variables is inferred given the current setting of all model parameters. Beal, Jojic and Attias, ICASSP’02

Learning using EM: M-Step Given the distribution over hidden variables, the parameters are set to maximize the data likelihood. • Video: • object templates ms and precisions fs • camera noise y • Audio: • Relative microphone attenuations l1,l2 and noise levels n1n2 • AV Calibration between modalities • a, b, nt Beal, Jojic and Attias, ICASSP’02

Efficient inference and integration over all shifts (Frey and Jojic, NIPS’01) E Estimating posterior Q(lx,ly,) involves computing Mahalanobis distances for all possible shifts in the image M Estimating model parameters involves integrating over all possible shifts taking into account the probability map Q(lx,ly,) E reduces to correlation, M reduces to convolution Efficiently done using FFTs Beal, Jojic and Attias, ICASSP’02

Demonstration of tracking A AV V na/nv Beal, Jojic and Attias, ICASSP’02

Learning using EM: M-Step Given the distribution over hidden variables, the parameters are set to maximize the data likelihood. • Video: • object templates ms and precisions fs • camera noise y • Audio: • Relative microphone attenuations l1,l2 and noise levels n1n2 • AV Calibration between modalities • a, b, nt Beal, Jojic and Attias, ICASSP’02

Inside EM iterations 1 2 4 10 Q(|x1,x2,y) Q(lx|x1,x2,y) Beal, Jojic and Attias, ICASSP’02

Tracking Stabilization Beal, Jojic and Attias, ICASSP’02

Work in progress: models • Incorporating a more sophisticated speech model • Layers of sound • Reverberation filters • Extension to y-localization is trivial. • Temporal models of speech. • Incorporating a more sophisticated video model • Layered templates (sprites) each with their own audio (circumvents dimensionality issues). • Fine-scale correlations between pixel intensities and speech. • Hierarchical models? (Factor Analyser trees). • Tractability issues: • Variational approximations in both audio and video. Beal, Jojic and Attias, ICASSP’02

Basic flexible layer model (CVPR’01) Beal, Jojic and Attias, ICASSP’02

Future work: applications • Multimedia editing • Removing/adding objects’ appearances and associated sounds. • With layers in both audio and video (cocktail party / danceclub). • Video-assisted speech enhancement • Improved denoising with knowledge of source location. • Exploit fine-scale correlations of video with audio. (e.g. lips) • Multimedia retrieval • Given a short clip as a query, search for similar matches in a database. Beal, Jojic and Attias, ICASSP’02

Summary • A generative model of audio-visual data • All parameters learned from the data, including camera/microphones calibration in a few iterations of EM • Extensions to multi-object models • Real issue: the other curse of dimensionality Beal, Jojic and Attias, ICASSP’02

Pixel-audio correlations analysis Original video sequence Factor Analysis (probabilistic PCA). SVD. Inferred activation of latent variables (factors, subspace vectors) Beal, Jojic and Attias, ICASSP’02

Audio-Visual Graphical Models

Audio-Visual Graphical Models

Presentation Transcript

Graphical Models

Visual and auditory scene analysis using graphical models

Incomplete Graphical Models

Graphical Models

Graphical Models

Graphical Models

Graphical Models - Inference -

GRAPHICAL MODELS

Probabilistic graphical models

Probabilistic Graphical Models

Compiling Graphical Models

Audio/Visual

Graphical Models

Graphical Multiagent Models

Graphical Causal Models

Undirected Graphical Models

Probabilistic Graphical Models

Graphical Models