Multi-Camera, Multi-Microphone Speech Recognition in a Moving Car

Multi-Camera, Multi-Microphone Speech Recognition in a Moving Car Mark Hasegawa-Johnson Thanks to the Motorola Communications Center And to Tom Huang, Ming Liu, Bowon Lee, Laehoon Kim, Camille Goudeseune, Michael McLaughlin, Sarah Borys, Jonathan Boley, Suketu Kamdar, Karen Livescu, and Partha Lal.

8 Mics, Pre-amps, Wooden Baffle. Best Place= Sunvisor. 4 Cameras, Glare Shields, Adjustable Mounting Best Place= Dashboard AVICAR Recording Hardware(Bowon Lee et al., ICSLP 2004) System is not permanently installed; mounting requires 10 minutes.

AVICAR Database • 100 Talkers • 4 Cameras, 7 Microphones • 5 noise conditions: Engine idling, 35mph, 35mph with windows open, 55mph, 55mph with windows open • Three types of utterances: • Digits & Phone numbers, for training and testing phone-number recognizers • Phonetically balanced sentences, for training and testing large vocabulary speech recognition • Isolated Letters, to test the use of video for an acoustically hard recognition problem • Open-IP public release to 15 institutions, 5 countries

Experiments with AVICAR Data • Audio • Multi-microphone voice activity detection • Multi-microphone spectral estimation • Video • Facial feature tracking & segmentation • Head pose and facial feature normalization • Fusion • Models of audio-visual asynchrony • Gestural-phonology based models of constriction state asynchrony

Noise from the Perspective of the Brainstem • Something happened!! • What does the new thing look/sound like? • What is it?

A Conservative Computational Audio(-ish) Model x(t) Lexical Access (Networks With Feedback) Average Background Loudness lN Auditory Nerve Carries The Loudness Spectrum ~ |X(f)|2/3 (Fletcher) Video Change Detection Basilar Membrane Variable Threshold Synapses = Amplitude Compression (Ghitza, 1986) Loudness To Phonetic Features (“tandem”) MMSE Estimator of “New Event” E[ |S(f)|2/3 | X ] Signal from the other ear

Optimal Multi-Channel Audio Estimation(Balan and Rosca, 2002; Laehoon Kim, ICASSP 2006) • Goal: MMSE estimate of clean speech loudness spectrum (|S(f)|2/3) given multichannel reverberant noisy measurement (X=HS+V, H = reverberant channel, V = noise) • Given a good channel estimate, the probability density p(|S(f)|2/3 | X) factors into two stages: • p(Ŝ | X), the PDF given X of the MMSE estimate of the time-domain speech signal • p(|S(f)|2/3 | Ŝ), the PDF given Ŝ of any function of S (e.g., of its loudness spectrum).

Results: MMSE-optimal beamformer followed by MMSE log spectral amplitude estimator

Word Error Rate with Beamformers Ten-digit phone numbers; trained and tested with 50/50 mix of quiet (engine idling) and very noisy (55mph, windows open); MLLR

WER with Spectral Estimation Ten-digit phone numbers; trained and tested with 50/50 mix of quiet (engine idling) and very noisy (55mph, windows open); MLLR

Video: Face & Lip Tracking (Ming Liu)

AVICAR “Constellation” • Four face rectangles provide information about head pose (useful for normalization) • Positions of lip within faces provide information about head pose (useful for normalization) • Lip height, width provide information about speech (useful for speech recognition) • DCT of pixels within all four lip rectangles provide information about speech (useful for speech recognition)

Facial Feature Variability • … tends to result in large changes in the feature mean (e.g., different talkers have different average lip-rectangle sizes) • Changes in the class-dependent feature mean can be compensated by maximum likelihood linear regression

Video-Only Phone Number Recognition LR = linear regression Model = model-based head-pose compensation LLR = log-linear regression 13+d+dd = 13 static features 39 = 39 static features All systems have mean and variance normalization and MLLR

Audio-Visual Asynchrony For example, tongue touches the teeth before acoustic speech onset in the word “three;” lips are already round in anticipation of the /r/.

Audio-Visual Asynchrony: Coupled HMM is a typical Phoneme-Viseme Model(Stephen Chu and Tom Huang, 2001)

Asynchrony in Gestural Phonology(Browman and Goldstein, 1992) “three” Round Spread Lips Dental Critical Retroflex Narrow Palatal Narrow Tongue Unvoiced Voiced Glottis time

Modeling Asynchrony Using Constriction State Variables in a DBN(Livescu and Saenko, 2005) Wordt Wordt+1 Glottist Glottist+1 Tonguet Tonguet+1 Lipst Lipst+1 Audiot Audiot+1 Videot Videot+1

Summary of Results • 100-talker audiovisual speech database was recorded in moving automobiles, is available free. • MMSE multichannel estimate of cepstral features. Best current WER in multi-noise condition: 3.5%. • Stereo video features: DCT+lip size, with lighting & head-pose normalization. Best current WER: 47% • Audiovisual fusion: goal is to test a DBN model based on articulatory phonology.

Multi-Camera, Multi-Microphone Speech Recognition in a Moving Car

Multi-Camera, Multi-Microphone Speech Recognition in a Moving Car

Presentation Transcript

Multi-Camera

The Multi-Focus Plenoptic Camera

Multi – Camera Techniques

Unit 23: multi-camera techniques

Multi-Camera Techniques

Unit 23: multi camera techniques

Multi camera techniques

My Multi-camera production

Multi-Person Multi-Camera Tracking for EasyLiving

Unit 23: multi-camera techniques

In-car Speech Recognition Using Distributed Microphones

Multi-microphone noise reduction and dereverberation techniques for speech applications

Single and Multi Channel Feature Enhancement for Distant Speech Recognition

A unification of adaptive multi-microphone noise reduction systems

Multi Camera Basics Theory

Multi Camera Television Lighting Director

Multi Camera System Market

Multi-car Accidents in California

MULTI STOREY CAR PARKING

SensEye: A Multi-Tier Camera Sensor Network

AVICAR: Audiovisual Speech Recognition in a Car

A Multi-Brand Car Workshop