1 / 43

Alexandrina Rogozan

UNIVERSITE du MAINE. Adaptive Fusion of Acoustic and Visual Sources for Automatic Speech Recognition. Alexandrina Rogozan. rogozan@lium.univ-lemans.fr. Bio Sketch. Assistant Professor in Computer Science and Electrical Engineering at University of Le Mans, France &

aquene
Download Presentation

Alexandrina Rogozan

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UNIVERSITE du MAINE Adaptive Fusion of Acoustic and Visual Sources for Automatic Speech Recognition Alexandrina Rogozan rogozan@lium.univ-lemans.fr

  2. Bio Sketch Assistant Professor in Computer Science and Electrical Engineering at University of Le Mans, France & Member of Speech Processing Group at LIUM 1999 : Ph.D. in Computer Science from University of Paris XI - Orsay • Heterogeneous Data Fusion for Audio-Visual Speech Recognition 1995-1997 : Participant at the French project AMIBE • Improvement of the Robustness and Confidentiality of Man- Machine Communication by using Audio and Visual Data Universities of Grenoble, Le Mans, Toulouse, Avignon, Paris 6 & INRIA Alexandrina Rogozan

  3. Research Activity • GOAL: Study the benefit of visual information for ASR • METHOD: Develop different audio-visual ASR systems • APPROACH: Copy the synergy observed in speech perception • EVALUATION: Test the accuracy of recognition process on a speaker-dependent connected-letter task Alexandrina Rogozan

  4. Overview 1.Challenges in Audio-Visual ASR 2.Audio-Visual Fusion Models 3.Implementation of the Proposed Hybrid-fusion Model 4. Improvements of the Hybrid-fusion Model 5. Results and Comparisons on the AMIBE Database 6. Conclusions et Perspectives Alexandrina Rogozan

  5. Face Tracking Visual Front End Lip Localization Visual-features Extraction Joint Treatment Integration Strategies Acoustic-features Extraction 1.Audio-Visual Speech System Overview • Obtaining the synergy of acoustic and visual modalities • Audio-visual fusion results > Uni-modal results Alexandrina Rogozan

  6. 1. Unanswered questions in AV ASR: • When has the audio-visual fusion take place: before or after the categorization in each modality ? • How to take into account the differences in the temporal evolution of speech events in acoustic and visual modalities ? • How to adapt the relative contribution of acoustic and visual modalities during the recognition process ? Alexandrina Rogozan

  7. V A 1.Relative contribution of acoustic and visual modalities • Speech features: Place & Manner of articulation & Voicing • Vary with the phonemic content Ex: Which modality to distinguish /m/ from /n/, but /m/ from /p/? • Vary with the environmental context Ex: Acoustic features on the place of articulation = the least robust ones • Exploit the complementary nature of modalities Alexandrina Rogozan

  8. 1.Differences in temporal evolution of phonemes in acoustic and visual modalities • Anticipation & Retention Phenomena: temporal shift up to 250 ms[Abry & Lalouache, 1991] • ‘Natural asynchrony’ • Handle with different phonemic frontiers • Vary with the phonemic content • Exploit the ‘natural asynchrony’ Alexandrina Rogozan

  9. Overview 1.Challenges in Audio-Visual ASR 2.Audio-Visual Fusion Models - One-level Fusion Architectures - Hybrid Fusion Architecture 3.Implementation of the Proposed Hybrid-Fusion Model 4. Improvements of the Hybrid-Fusion Model 5. Results and Comparisons on the AMIBE Database 6. Conclusions et Perspectives Alexandrina Rogozan

  10. Acoustic Data Recognized Speech-Unit Fusion Categorization Visual Data Acoustic Data Recognized Speech-Unit Categorization Fusion Visual Data 2.One-level fusion architectures • At the Data (Features) Level • At the Results (Decision) Level Alexandrina Rogozan

  11. Acoustic Data Recognized Speech-Unit Categorization Visual Data Adaptation Acoustic Data Recognized Speech-Unit Categorization Fusion Visual Data Adaptation 2. Fusion before categorization • Concatenation or Direct Identification (DI) • Re-coding in the Dominant (RD) or in a Motor space (RM) [Robert-Ribes, 1995] Pb: Choice of the dominant space nature&Temporal ‘resynchronization’ in the common space Alexandrina Rogozan

  12. Acoustic Data Categorization Recognized Speech-Unit Fusion Visual Data Categorization Adaptation Acoustic Data Categorization Recognized Speech-Unit Fusion Visual Data Categorization Adaptation 2. Fusion after categorization • Separate Identification (SI) • Parallel Structure • Serial Structure Alexandrina Rogozan

  13. Alexandrina Rogozan

  14. 2.Level of audio-visual fusion in speech perception Ex: Lip image + Larynx-frequency => Voicing features [Grant, 1985] pulse train ( 4,7 % ) ( 28,9 % ) ( 51,1 % ) • Audio-visual fusion before categorisation Ex: Lip image (t) + Speech signal (t+T) => McGurk illusions [Massaro, 1996] /ga/V /ba/A /da/AV • Audio-visual fusion after categorisation • Flexibility and robustness of speech perception • Adaptability of the fusion mechanisms Alexandrina Rogozan

  15. Discrete, categorical space of results Continuous, time-varying space of data A a Categorization AV av DI Categorization v V Categorization Sequence of phonemes SI Fusion Adaptation 2. Hybrid-fusion model for Audio-Visual ASR Alexandrina Rogozan

  16. Overview 1.Challenges in Audio-Visual ASR 2.Audio-Visual Fusion Models 3.Implementation of the Proposed Hybrid-fusion Model - Structure of the DI-based Fusion - Structure of the SI-based Fusion 4. Improvements of the Hybrid-fusion Model 5. Results and Comparisons on the AMIBE Database 6. Conclusions et Perspectives Alexandrina Rogozan

  17. Discrete, categorical space of results Continuous, time-varying space of data A a Categorization AV av DI Phonemic HMM v V Categorization Sequence of phonemes SI Fusion Adaptation 3. Implementation of the DI-based fusion Alexandrina Rogozan

  18. 3. Characteristics of the DI-based fusion • Synchronisation of acoustic and visual speech events on the phonemic HMM states • Visual TOO strengthPerturbs the acoustic at the time of TRANSITION between HMM states and of speech-unit LABELING • Necessity to adapt the DI-based fusion Alexandrina Rogozan

  19. 3. Adaptation of the DI-based fusion • To the RELEVANCE of speech features in each modality • To the RELIABILITY of processing in each modality • Necessity to estimate a posteriori the reliability of the global process Alexandrina Rogozan

  20.  i A Fusion Phonemic HMM Sequence of phonemes . . . . . . V Fusion Phonemic HMM Choice  j 3. Realization of the adaptation in the DI-based fusion • Exponential weight  : • Global to the recognition hypothesis • Selected a posteriori • According to the SNR & the phonemic content Alexandrina Rogozan

  21. Discrete, categorical space of results Continuous, time-varying space of data A a Categorization AV av DI Phonemic HMM v V Categorization Sequence of phonemes SI Fusion Adaptation 3. Choice of the hybrid-fusion architecture DI + V= >Asynchronous fusion of information Alexandrina Rogozan

  22. Continuous, time-varying space of data Discrete, categorical space of results N-best phonetically  solutions av AV DI Phonemic HMM v V Phonemic HMM Sequence of phonemes SI Fusion Adaptation 3. Implementation of SI-based fusion • Serial structure=> visual evaluation of DI solutions Alexandrina Rogozan

  23. 3. Characteristics of the SI-based fusion • Multiplication of the modality output probabilities • Possibility of temporal shift up to 100 ms between modality phonemic frontiers • ‘Natural asynchrony’ allowed • Visual TOO strengthPerturbs the acoustic at the time of speech-unit LABELING • Necessity to adapt the SI-based fusion Alexandrina Rogozan

  24. 3. Realization of the adaptation in the SI-based fusion • Exponential weight  : • Calculated a posteriori according to the relative reliability of acoustic and visual modalities • Dispersion of 4-best solutions • Variation of  with the SNR on the test data Alexandrina Rogozan

  25. Overview 1.Challenges in Audio-Visual ASR 2.Audio-Visual Fusion Models 3.Implementation of the Proposed Hybrid-fusion Model 4. Improvements of the Hybrid-fusion Model - Visual categorization - Parallel Structure for the SI-based Fusion 5. Results and Comparisons on the AMIBE Database 6. Conclusions et Perspectives Alexandrina Rogozan

  26. 4.Type of interaction in the SI-based fusion • Effective integrationvscoherence verification • depends on the ratio : • IMPROVEMENT: Reinforcement of the purely-visual component • Discriminative learning • Effective visual categorization Alexandrina Rogozan

  27. 4. Discriminative learning of visual speech by Neural Networks (NN) • Necessity of relevant visual differences between the classes to discriminate • Inconsistent with phonemic classes because of visual doubles i. e. /p/, /b/, /m/ • Use of adapted classes : VISEMES • Sources of variability • language, speech rate, among speakers Alexandrina Rogozan

  28. Consonant visemes s d k ch z t p b m f v j r l g  a u e i  o y Vowel visemes 4. Definition of visemes • Extraction of visual phonemes from the training data • Middle of acoustic-phonemic segments anchors visual segments of 140 ms • Mapping of extracted visual phonemes • Kohonen’s algorithm for Self Organising Map (SOM) • Identification of visemes • 3 resolution levels n Alexandrina Rogozan

  29. 4. Reinforced purely-visual component in SI parallel structure • Getting ride of temporal dependence between DI and V • Effective visual categorisation • Difficulty to take into account the temporal dimension of speech with NN • Towards hybrid HMM - NN based categorization Alexandrina Rogozan

  30. NN + HMM HMM + NN HMM / NN Visible speech Visible speech Visible speech NN HMM Segmentation A posteriori probabilities HMM NN Segmentation HMM NN Visemes confusion Recognized sequence of visemes Recognized sequence of visemes Recognized sequence of visemes Recognized sequence of visemes 4. Hybrid HMM - NN based categorization Alexandrina Rogozan

  31. Discrete, categorical space of results Continuous, time-varying space of data av AV DI Phonemic HMM V v Visemic HMM / NN SI Sequence of phonemes Fusion Adaptation 4. Reinforced purely-visual component in SI parallel structure Non-homogeneity of output scores • Inconsistent with previous multiplicative-based SI fusion Alexandrina Rogozan

  32. Discrete, categorical space of results 1 N solutions N Sequence of phonemes av . . . Phonemes => Visemes 1 v Edition-distance based alignment Likelihood rate calculation Adaptation 4. Implementation of SI-based fusion in a parallel structure Alexandrina Rogozan

  33. Audible Speech Fusion Categorization Visual Speech Categorization Fusion Perceived speech Categorization Adaptation 4.‘ Phonetic PlusPost-categorical ’ proposed by Burnham (1998) • 2-level fusion architecture • visual categorization by comparison to visemic prototypes • facultative use of purely-visual component after categorization Alexandrina Rogozan

  34. Overview 1.Challenges in Audio-Visual ASR 2.Audio-Visual Fusion Models 3.Implementation of the Proposed Hybrid-fusion Model 4. Improvements of the Hybrid-fusion Model 5.Results and Comparisons on the AMIBE Database 6.Conclusions et Perspectives Alexandrina Rogozan

  35. 5. Experiments • Audio-visual data of AMIBE project • connected letters • ‘dining-hall’ noise at SNR of 10 dB, 0 dB and -10 dB • Speech features • Visual: internal lip-shape height, width and area + ’ + ’’ • Acoustic: 12 MFCC + energy + ’ + ’’ • Speech modeling • HMM + duration model [Suaudeau & André-Obrecht, 1994] • TDNN, SOM Alexandrina Rogozan

  36. 5. Results The hybrid-fusion model DI+V allows for obtaining the audiovisual synergy. Alexandrina Rogozan

  37. DI+V 41,9 77,8 91,2 95,8 5. Results -10 dB 0 dB 10 dB clean AUDIO -2,1 67,9 88 91,5 VISUAL 30,9 30,9 30,9 30,9 DI 40,8 76,4 90,8 95,4 SI 6,3 81,6 89,4 91,9 Alexandrina Rogozan

  38. 5. Comparisons • Master-Slave Model proposed at IRIT, Univ. Toulouse [André-Orecht et al, 1997] • Product of Models proposed at LIUPAV, Univ. Avignon [Jourlin, 1998] Alexandrina Rogozan

  39. Master labial HMM Open lips Close lips Semi-open lips Slave acoustic HMM 5. Master-Slave Model of IRIT (1997) Acoustic HMM parameters = Probabilistic functions of the master-labial HMM model Alexandrina Rogozan

  40. Visual HMM Acoustic HMM D6(V) T11 x T44 T11 x T45 T12 x T56 T23 x T66 Audio-visual HMM T22 T33 T11 T66 T55 T44 T12 T23 T45 T56 D2(A) D3(A) D1(A) D5(V) D4(V) 3,6 2 1 3 1,5 1,4 1,6 2,5 2,4 3,5 3,4 2,6 5 4 6 5. Product of Models of LIUAPV (1998) The audio-visual HMM parameters are computed from separate acoustic and visual HMMs. Alexandrina Rogozan

  41. 6. Conclusion : Contribution • Taking into account the amount of problems in AV ASR Fusion (A)synchrony Adaptation Visemes • Proposition of the hybrid-fusion DI+V model • Audio-visual fusion adaptation a posteriori to variations of both the context and the content • Definition of specific-visual units, visemes, by auto-organisation and grouping Alexandrina Rogozan

  42. 6. Conclusion : Further Work • Use of visemes also during the DI-based fusion • Learning of temporal shifts between modalities for the SI-based fusion • Definition of a dependency function between pre and post categorical weights • Modality-weight estimation at a finer level • Learning on consequent training data & Extensive testing Alexandrina Rogozan

  43. 6. Perspectives Towards a global platform for Audio-Visual Speech Communication • Preprocessing • source localization, enhancement of speech signal, scene analysis • Recognition • Synthesis • Coding Alexandrina Rogozan

More Related