1 / 14

Multimodal Speech and Dialog Recognition

This research explores robust features from multimodal sensors, including binaural hearing, robust acoustic features, face tracking, gesture recognition, dynamic Bayesian networks, speechreading, prosody, user state recognition, and automatic language acquisition.

waret
Download Presentation

Multimodal Speech and Dialog Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multimodal Speech and Dialog Recognition Stephen Levinson, Thomas Huang, Mark Hasegawa-Johnson, Ken Chen, Stephen Chu, Ashutosh Garg, Zhinian Jing, Danfeng Li, John Lin, Mohamed Omar, Zhen Wen

  2. Outline • Robust Features from Multimodal Sensors • Binaural Hearing • Robust Acoustic Features • Face Tracking • Gesture Recognition • Recognition: Dynamic Bayesian Networks • Speechreading (CHMM) • Prosody (FHMM) • User State Recognition (HHMM) • Automatic Language Acquisition (Association)

  3. Research Environment 1: Multimodal HS Physics Tutor • User State Recognition determines: • System initiative vs. User initiative • Choice of topics, wording of explanations • User State based on observation of: • Dialog content (correctness, complexity) • Emotion recognition (pitch, speaking rate, video) • Usability Testers: Urbana HS Students

  4. Research Environment 2: Illy, the Autonomous Language Learning Robot

  5. Binaural Hearing • Modified Griffiths-Jim beamformer • Direction of arrival cues learned from training sessions • Physically mobile platform allows efficient off-axis noise cancellation

  6. MMI Feature Selection • MMI Acoustic Features • Select an approximate “Markov Blanket” • Independence measured using mutual information • Phoneme Recognition in quiet: • MFCC: 58% • LPCC: 56% • MMIA: 62%

  7. Auditory Scene Analysis • Voice-Index LPC: • VI = sub-band source membership function • Based on Auditory Scene Analysis (Meddis & Hewitt, 1992) • Digit Recognition: • MFCC: 20% @ 0dB • LPCC: 10% @ 0dB • VI-LPCC: 75% @ 0dB

  8. Face Tracking Piecewise Bezier Volume Deformation model

  9. Gesture Recognition • Model-based and Appearance-based methods running • Model-based recognition: • 21 model dimensions • 7-dimensional PCA • 28 basis shapes map out most useful configurations

  10. Audiovisual Speechreading • Audio = MFCC • Visual = 8 lip contour positions • A+V = feature vector concatenation • CHMM = coupled HMM, implemented in HTK.

  11. Prosody Duration & Observation PDFs depend on (q=Phoneme State) & (k=Prosody State)

  12. User State Modeling

  13. Automatic Language Acquisition

  14. Conclusions • Objectives: • Theoretical Understanding • Practical Applications • Noise robustness achieved through: • Audiovisual Feature Design • Multimodal Integration

More Related