130 likes | 141 Views
AVICAR, an innovative recording hardware system placed easily on the dashboard, provides a diverse database for speech recognition experiments, emphasizing accuracy and noise reduction. Advanced video, audio, and noise enhancement technologies enhance speech recognition performance in cars. The system demonstrates a significant reduction in Word Error Rate (WER) compared to traditional models, making it ideal for hands-free communication while driving.
E N D
AVICAR: Audiovisual Speech Recognition in a Car Mark Hasegawa-Johnson, Thomas Huang, Stephen E. Levinson, Camille Goudeseune, Hank Kaczmarski, Michael McLaughlin, Yoshihisa Shinagawa Bowon Lee, Ming Liu, Laehoon Kim, Ameya Deoras, Sarah Borys, Jonathan Boley, Suketu Kamdar, Danfeng Li
8 Mics, Pre-amps, Wooden Baffle. Best Place= Sunvisor. 4 Cameras, Glare Shields, Adjustable Mounting Best Place= Dashboard AVICAR Recording Hardware System is not permanently installed; mounting requires 10 minutes.
AVICAR Database • 100 Talkers • 4 Microphones, 8 Cameras • 5 noise conditions: Engine idling, 35mph, 35mph with windows open, 55mph, 55mph with windows open • Two types of utterances: • Digits & Phone numbers, for training and testing phone-number recognizers • Phonetically balanced sentences, for training and testing large vocabulary speech recognition • Open-IP public release to 15 institutions, 3 continents
Experiments with AVICAR Data • Video • Lip Tracking & Video Feature Extraction • 3D-from-Stereo Video Feature Extraction • Audio • Beamforming & Speech Detection • Noise Modeling & Speech Enhancement • Speech Recognition
Left image Right image Transformed right image computed from left image Video: 3D-from-Stereo • Point correspondences computed using dense stereo matching • Correspondence around the lips is good most of the time. • Occasional large errors caused by big differences of brightness and background.
Speech Enhancement: MVDR Beamformer + MMSE-logSA • Goal: MMSE estimate of clean speech cepstrum given multichannel noisy measurement • Solution: • Beamformer based on explicit models of (1) inter-microphone noise coherence and (2) auto interior frequency response... • Followed by a single-channel MMSE log spectral amplitude estimator
MVDR+MMSE-logSA MVDR eliminates high-frequency noise, MMSE-logSA eliminates low-frequency noise MMSE-logSA adds reverberation at low frequencies; reverberation seems to not effect speech recognition accuracy
Summary of Results • 100-talker audiovisual speech database recorded in moving automobiles • Stereo video features for visual speech recognition • MMSE multichannel estimate of cepstral features • Recognition using factorial HMM. Preliminary results: 57% WER reduction
Multimodal Speech Recognition in Noise: Factorial HMM • Chain q(t) models speech audio • Chain r(t) models noise audio • Third chain (not shown) will model speech video; speech video & audio synchronized via joint transition probabilities (Chu & Huang, 2002)
Excerpt and Speaker Sequence HMM Performance FHMM Performance Movement Digits Trials WER Ave. SNR Ave. SNR WER Allegro Assai Random 0 - 6 35 83% -0.39 35% -0.65 Sequential Allegro Assai 0 - 6 35 77% 0.55 34% 0.55 Andante Random 3 - 8 35 70% 4.34 27% 4.77 Sequential Andante 1 - 8 72 70% 10.43 34% 10.43 Factorial HMM has 57% lower Word Error Rate (WER) than HMM Factorial HMM Preliminary Test: Recognize Speech in Music Factorial HMM has 57% lower Word Error Rate (WER) than HMM