1 / 13

AVICAR: Audiovisual Speech Recognition in a Car

AVICAR, an innovative recording hardware system placed easily on the dashboard, provides a diverse database for speech recognition experiments, emphasizing accuracy and noise reduction. Advanced video, audio, and noise enhancement technologies enhance speech recognition performance in cars. The system demonstrates a significant reduction in Word Error Rate (WER) compared to traditional models, making it ideal for hands-free communication while driving.

Download Presentation

AVICAR: Audiovisual Speech Recognition in a Car

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AVICAR: Audiovisual Speech Recognition in a Car Mark Hasegawa-Johnson, Thomas Huang, Stephen E. Levinson, Camille Goudeseune, Hank Kaczmarski, Michael McLaughlin, Yoshihisa Shinagawa Bowon Lee, Ming Liu, Laehoon Kim, Ameya Deoras, Sarah Borys, Jonathan Boley, Suketu Kamdar, Danfeng Li

  2. 8 Mics, Pre-amps, Wooden Baffle. Best Place= Sunvisor. 4 Cameras, Glare Shields, Adjustable Mounting Best Place= Dashboard AVICAR Recording Hardware System is not permanently installed; mounting requires 10 minutes.

  3. AVICAR Database • 100 Talkers • 4 Microphones, 8 Cameras • 5 noise conditions: Engine idling, 35mph, 35mph with windows open, 55mph, 55mph with windows open • Two types of utterances: • Digits & Phone numbers, for training and testing phone-number recognizers • Phonetically balanced sentences, for training and testing large vocabulary speech recognition • Open-IP public release to 15 institutions, 3 continents

  4. AVICAR Database

  5. Experiments with AVICAR Data • Video • Lip Tracking & Video Feature Extraction • 3D-from-Stereo Video Feature Extraction • Audio • Beamforming & Speech Detection • Noise Modeling & Speech Enhancement • Speech Recognition

  6. Left image Right image Transformed right image computed from left image Video: 3D-from-Stereo • Point correspondences computed using dense stereo matching • Correspondence around the lips is good most of the time. • Occasional large errors caused by big differences of brightness and background.

  7. Speech Enhancement: MVDR Beamformer + MMSE-logSA • Goal: MMSE estimate of clean speech cepstrum given multichannel noisy measurement • Solution: • Beamformer based on explicit models of (1) inter-microphone noise coherence and (2) auto interior frequency response... • Followed by a single-channel MMSE log spectral amplitude estimator

  8. MVDR+MMSE-logSA MVDR eliminates high-frequency noise, MMSE-logSA eliminates low-frequency noise MMSE-logSA adds reverberation at low frequencies; reverberation seems to not effect speech recognition accuracy

  9. Speech Recognition Accuracy

  10. Summary of Results • 100-talker audiovisual speech database recorded in moving automobiles • Stereo video features for visual speech recognition • MMSE multichannel estimate of cepstral features • Recognition using factorial HMM. Preliminary results: 57% WER reduction

  11. Multimodal Speech Recognition in Noise: Factorial HMM • Chain q(t) models speech audio • Chain r(t) models noise audio • Third chain (not shown) will model speech video; speech video & audio synchronized via joint transition probabilities (Chu & Huang, 2002)

  12. Excerpt and Speaker Sequence HMM Performance FHMM Performance Movement Digits Trials WER Ave. SNR Ave. SNR WER Allegro Assai Random 0 - 6 35 83% -0.39 35% -0.65 Sequential Allegro Assai 0 - 6 35 77% 0.55 34% 0.55 Andante Random 3 - 8 35 70% 4.34 27% 4.77 Sequential Andante 1 - 8 72 70% 10.43 34% 10.43 Factorial HMM has 57% lower Word Error Rate (WER) than HMM Factorial HMM Preliminary Test: Recognize Speech in Music Factorial HMM has 57% lower Word Error Rate (WER) than HMM

More Related