430 likes | 609 Views
UNIVERSITE du MAINE. Adaptive Fusion of Acoustic and Visual Sources for Automatic Speech Recognition. Alexandrina Rogozan. rogozan@lium.univ-lemans.fr. Bio Sketch. Assistant Professor in Computer Science and Electrical Engineering at University of Le Mans, France &
E N D
UNIVERSITE du MAINE Adaptive Fusion of Acoustic and Visual Sources for Automatic Speech Recognition Alexandrina Rogozan rogozan@lium.univ-lemans.fr
Bio Sketch Assistant Professor in Computer Science and Electrical Engineering at University of Le Mans, France & Member of Speech Processing Group at LIUM 1999 : Ph.D. in Computer Science from University of Paris XI - Orsay • Heterogeneous Data Fusion for Audio-Visual Speech Recognition 1995-1997 : Participant at the French project AMIBE • Improvement of the Robustness and Confidentiality of Man- Machine Communication by using Audio and Visual Data Universities of Grenoble, Le Mans, Toulouse, Avignon, Paris 6 & INRIA Alexandrina Rogozan
Research Activity • GOAL: Study the benefit of visual information for ASR • METHOD: Develop different audio-visual ASR systems • APPROACH: Copy the synergy observed in speech perception • EVALUATION: Test the accuracy of recognition process on a speaker-dependent connected-letter task Alexandrina Rogozan
Overview 1.Challenges in Audio-Visual ASR 2.Audio-Visual Fusion Models 3.Implementation of the Proposed Hybrid-fusion Model 4. Improvements of the Hybrid-fusion Model 5. Results and Comparisons on the AMIBE Database 6. Conclusions et Perspectives Alexandrina Rogozan
Face Tracking Visual Front End Lip Localization Visual-features Extraction Joint Treatment Integration Strategies Acoustic-features Extraction 1.Audio-Visual Speech System Overview • Obtaining the synergy of acoustic and visual modalities • Audio-visual fusion results > Uni-modal results Alexandrina Rogozan
1. Unanswered questions in AV ASR: • When has the audio-visual fusion take place: before or after the categorization in each modality ? • How to take into account the differences in the temporal evolution of speech events in acoustic and visual modalities ? • How to adapt the relative contribution of acoustic and visual modalities during the recognition process ? Alexandrina Rogozan
V A 1.Relative contribution of acoustic and visual modalities • Speech features: Place & Manner of articulation & Voicing • Vary with the phonemic content Ex: Which modality to distinguish /m/ from /n/, but /m/ from /p/? • Vary with the environmental context Ex: Acoustic features on the place of articulation = the least robust ones • Exploit the complementary nature of modalities Alexandrina Rogozan
1.Differences in temporal evolution of phonemes in acoustic and visual modalities • Anticipation & Retention Phenomena: temporal shift up to 250 ms[Abry & Lalouache, 1991] • ‘Natural asynchrony’ • Handle with different phonemic frontiers • Vary with the phonemic content • Exploit the ‘natural asynchrony’ Alexandrina Rogozan
Overview 1.Challenges in Audio-Visual ASR 2.Audio-Visual Fusion Models - One-level Fusion Architectures - Hybrid Fusion Architecture 3.Implementation of the Proposed Hybrid-Fusion Model 4. Improvements of the Hybrid-Fusion Model 5. Results and Comparisons on the AMIBE Database 6. Conclusions et Perspectives Alexandrina Rogozan
Acoustic Data Recognized Speech-Unit Fusion Categorization Visual Data Acoustic Data Recognized Speech-Unit Categorization Fusion Visual Data 2.One-level fusion architectures • At the Data (Features) Level • At the Results (Decision) Level Alexandrina Rogozan
Acoustic Data Recognized Speech-Unit Categorization Visual Data Adaptation Acoustic Data Recognized Speech-Unit Categorization Fusion Visual Data Adaptation 2. Fusion before categorization • Concatenation or Direct Identification (DI) • Re-coding in the Dominant (RD) or in a Motor space (RM) [Robert-Ribes, 1995] Pb: Choice of the dominant space nature&Temporal ‘resynchronization’ in the common space Alexandrina Rogozan
Acoustic Data Categorization Recognized Speech-Unit Fusion Visual Data Categorization Adaptation Acoustic Data Categorization Recognized Speech-Unit Fusion Visual Data Categorization Adaptation 2. Fusion after categorization • Separate Identification (SI) • Parallel Structure • Serial Structure Alexandrina Rogozan
2.Level of audio-visual fusion in speech perception Ex: Lip image + Larynx-frequency => Voicing features [Grant, 1985] pulse train ( 4,7 % ) ( 28,9 % ) ( 51,1 % ) • Audio-visual fusion before categorisation Ex: Lip image (t) + Speech signal (t+T) => McGurk illusions [Massaro, 1996] /ga/V /ba/A /da/AV • Audio-visual fusion after categorisation • Flexibility and robustness of speech perception • Adaptability of the fusion mechanisms Alexandrina Rogozan
Discrete, categorical space of results Continuous, time-varying space of data A a Categorization AV av DI Categorization v V Categorization Sequence of phonemes SI Fusion Adaptation 2. Hybrid-fusion model for Audio-Visual ASR Alexandrina Rogozan
Overview 1.Challenges in Audio-Visual ASR 2.Audio-Visual Fusion Models 3.Implementation of the Proposed Hybrid-fusion Model - Structure of the DI-based Fusion - Structure of the SI-based Fusion 4. Improvements of the Hybrid-fusion Model 5. Results and Comparisons on the AMIBE Database 6. Conclusions et Perspectives Alexandrina Rogozan
Discrete, categorical space of results Continuous, time-varying space of data A a Categorization AV av DI Phonemic HMM v V Categorization Sequence of phonemes SI Fusion Adaptation 3. Implementation of the DI-based fusion Alexandrina Rogozan
3. Characteristics of the DI-based fusion • Synchronisation of acoustic and visual speech events on the phonemic HMM states • Visual TOO strengthPerturbs the acoustic at the time of TRANSITION between HMM states and of speech-unit LABELING • Necessity to adapt the DI-based fusion Alexandrina Rogozan
3. Adaptation of the DI-based fusion • To the RELEVANCE of speech features in each modality • To the RELIABILITY of processing in each modality • Necessity to estimate a posteriori the reliability of the global process Alexandrina Rogozan
i A Fusion Phonemic HMM Sequence of phonemes . . . . . . V Fusion Phonemic HMM Choice j 3. Realization of the adaptation in the DI-based fusion • Exponential weight : • Global to the recognition hypothesis • Selected a posteriori • According to the SNR & the phonemic content Alexandrina Rogozan
Discrete, categorical space of results Continuous, time-varying space of data A a Categorization AV av DI Phonemic HMM v V Categorization Sequence of phonemes SI Fusion Adaptation 3. Choice of the hybrid-fusion architecture DI + V= >Asynchronous fusion of information Alexandrina Rogozan
Continuous, time-varying space of data Discrete, categorical space of results N-best phonetically solutions av AV DI Phonemic HMM v V Phonemic HMM Sequence of phonemes SI Fusion Adaptation 3. Implementation of SI-based fusion • Serial structure=> visual evaluation of DI solutions Alexandrina Rogozan
3. Characteristics of the SI-based fusion • Multiplication of the modality output probabilities • Possibility of temporal shift up to 100 ms between modality phonemic frontiers • ‘Natural asynchrony’ allowed • Visual TOO strengthPerturbs the acoustic at the time of speech-unit LABELING • Necessity to adapt the SI-based fusion Alexandrina Rogozan
3. Realization of the adaptation in the SI-based fusion • Exponential weight : • Calculated a posteriori according to the relative reliability of acoustic and visual modalities • Dispersion of 4-best solutions • Variation of with the SNR on the test data Alexandrina Rogozan
Overview 1.Challenges in Audio-Visual ASR 2.Audio-Visual Fusion Models 3.Implementation of the Proposed Hybrid-fusion Model 4. Improvements of the Hybrid-fusion Model - Visual categorization - Parallel Structure for the SI-based Fusion 5. Results and Comparisons on the AMIBE Database 6. Conclusions et Perspectives Alexandrina Rogozan
4.Type of interaction in the SI-based fusion • Effective integrationvscoherence verification • depends on the ratio : • IMPROVEMENT: Reinforcement of the purely-visual component • Discriminative learning • Effective visual categorization Alexandrina Rogozan
4. Discriminative learning of visual speech by Neural Networks (NN) • Necessity of relevant visual differences between the classes to discriminate • Inconsistent with phonemic classes because of visual doubles i. e. /p/, /b/, /m/ • Use of adapted classes : VISEMES • Sources of variability • language, speech rate, among speakers Alexandrina Rogozan
Consonant visemes s d k ch z t p b m f v j r l g a u e i o y Vowel visemes 4. Definition of visemes • Extraction of visual phonemes from the training data • Middle of acoustic-phonemic segments anchors visual segments of 140 ms • Mapping of extracted visual phonemes • Kohonen’s algorithm for Self Organising Map (SOM) • Identification of visemes • 3 resolution levels n Alexandrina Rogozan
4. Reinforced purely-visual component in SI parallel structure • Getting ride of temporal dependence between DI and V • Effective visual categorisation • Difficulty to take into account the temporal dimension of speech with NN • Towards hybrid HMM - NN based categorization Alexandrina Rogozan
NN + HMM HMM + NN HMM / NN Visible speech Visible speech Visible speech NN HMM Segmentation A posteriori probabilities HMM NN Segmentation HMM NN Visemes confusion Recognized sequence of visemes Recognized sequence of visemes Recognized sequence of visemes Recognized sequence of visemes 4. Hybrid HMM - NN based categorization Alexandrina Rogozan
Discrete, categorical space of results Continuous, time-varying space of data av AV DI Phonemic HMM V v Visemic HMM / NN SI Sequence of phonemes Fusion Adaptation 4. Reinforced purely-visual component in SI parallel structure Non-homogeneity of output scores • Inconsistent with previous multiplicative-based SI fusion Alexandrina Rogozan
Discrete, categorical space of results 1 N solutions N Sequence of phonemes av . . . Phonemes => Visemes 1 v Edition-distance based alignment Likelihood rate calculation Adaptation 4. Implementation of SI-based fusion in a parallel structure Alexandrina Rogozan
Audible Speech Fusion Categorization Visual Speech Categorization Fusion Perceived speech Categorization Adaptation 4.‘ Phonetic PlusPost-categorical ’ proposed by Burnham (1998) • 2-level fusion architecture • visual categorization by comparison to visemic prototypes • facultative use of purely-visual component after categorization Alexandrina Rogozan
Overview 1.Challenges in Audio-Visual ASR 2.Audio-Visual Fusion Models 3.Implementation of the Proposed Hybrid-fusion Model 4. Improvements of the Hybrid-fusion Model 5.Results and Comparisons on the AMIBE Database 6.Conclusions et Perspectives Alexandrina Rogozan
5. Experiments • Audio-visual data of AMIBE project • connected letters • ‘dining-hall’ noise at SNR of 10 dB, 0 dB and -10 dB • Speech features • Visual: internal lip-shape height, width and area + ’ + ’’ • Acoustic: 12 MFCC + energy + ’ + ’’ • Speech modeling • HMM + duration model [Suaudeau & André-Obrecht, 1994] • TDNN, SOM Alexandrina Rogozan
5. Results The hybrid-fusion model DI+V allows for obtaining the audiovisual synergy. Alexandrina Rogozan
DI+V 41,9 77,8 91,2 95,8 5. Results -10 dB 0 dB 10 dB clean AUDIO -2,1 67,9 88 91,5 VISUAL 30,9 30,9 30,9 30,9 DI 40,8 76,4 90,8 95,4 SI 6,3 81,6 89,4 91,9 Alexandrina Rogozan
5. Comparisons • Master-Slave Model proposed at IRIT, Univ. Toulouse [André-Orecht et al, 1997] • Product of Models proposed at LIUPAV, Univ. Avignon [Jourlin, 1998] Alexandrina Rogozan
Master labial HMM Open lips Close lips Semi-open lips Slave acoustic HMM 5. Master-Slave Model of IRIT (1997) Acoustic HMM parameters = Probabilistic functions of the master-labial HMM model Alexandrina Rogozan
Visual HMM Acoustic HMM D6(V) T11 x T44 T11 x T45 T12 x T56 T23 x T66 Audio-visual HMM T22 T33 T11 T66 T55 T44 T12 T23 T45 T56 D2(A) D3(A) D1(A) D5(V) D4(V) 3,6 2 1 3 1,5 1,4 1,6 2,5 2,4 3,5 3,4 2,6 5 4 6 5. Product of Models of LIUAPV (1998) The audio-visual HMM parameters are computed from separate acoustic and visual HMMs. Alexandrina Rogozan
6. Conclusion : Contribution • Taking into account the amount of problems in AV ASR Fusion (A)synchrony Adaptation Visemes • Proposition of the hybrid-fusion DI+V model • Audio-visual fusion adaptation a posteriori to variations of both the context and the content • Definition of specific-visual units, visemes, by auto-organisation and grouping Alexandrina Rogozan
6. Conclusion : Further Work • Use of visemes also during the DI-based fusion • Learning of temporal shifts between modalities for the SI-based fusion • Definition of a dependency function between pre and post categorical weights • Modality-weight estimation at a finer level • Learning on consequent training data & Extensive testing Alexandrina Rogozan
6. Perspectives Towards a global platform for Audio-Visual Speech Communication • Preprocessing • source localization, enhancement of speech signal, scene analysis • Recognition • Synthesis • Coding Alexandrina Rogozan