1 / 21

Learning optimal audiovisual phasing for an HMM-based control model for facial animation

Learning optimal audiovisual phasing for an HMM-based control model for facial animation. O. Govokhina (1,2), G. Bailly (2), G. Breton (1) (1) France Telecom R&D – Rennes (2) GIPSA-lab, dpt. Parole&Cognition – Grenoble SSW6, Bonn, August 2007. Agenda. 1. Facial animation

fpickle
Download Presentation

Learning optimal audiovisual phasing for an HMM-based control model for facial animation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning optimal audiovisual phasing for an HMM-based control model for facial animation O. Govokhina (1,2), G. Bailly (2), G. Breton (1) (1) France Telecom R&D – Rennes (2) GIPSA-lab, dpt. Parole&Cognition – Grenoble SSW6, Bonn, August 2007

  2. Agenda 1 • Facial animation • Data and articulatory model • Trajectory formation models • State of the art • First improvement: Task-Dynamics for Animation (TDA) • Multimodal coordination • AV asynchrony • PHMM: Phased Hidden Markov Model • Results and conclusions 2 3 4

  3. 1 Facial Animation

  4. Analysis AV Data Learning Motion Capture Control model Shape model Appearance model Phonetic input Synthesis Facial Animation • Domain: Visual speech synthesis • Control model • Computes multiparametric trajectories from phonetic input • Shape model • Specifies how facial geometry is modified by articulatory parameters • Appearance model • Final image rendering • Data from Motion Capture Facial animation

  5. 2 Data and articulatory model

  6. aperture width protrusion Data and articulatory model • Audiovisual database FT • 540 sentences, one female subject • 150 colored beads, automatic tracking • Cloning methodology developed at ICP Badin et al., 2002; Revéret et al., 2000 • Visual parameters: • 3 geometric parameters: Lips aperture/closure, Lips width, Lips protrusion • 6 articulatory parameters: Jaw opening, Jaw advance, Lips rounding, Upper lip movements, Lower lip movements, Throat movements

  7. 3 Trajectory formation systems

  8. Trajectory formation systems.State of the art • Control models • Visual-only • Coarticulation models • Massaro-Cohen; Öhman, … • Triphones, cinematic models • Deng; Okadome, Kaburagi & Honda, … • From acoustics • Linear vs. Nonlinear mappings • Yehia et al; Berthommier • Nakamura et al: voice conversion (GMM, HMM) used for speech to articulatory inversion • Multimodal • Synthesis by concatenation • Minnis et al; Bailly et al, ... • HMM synthesis • Masuko et al; Tokuda et al, …

  9. Unit selection/concatenation Linguistic processing Prosodic model Parametric synthesis Trajectory formation systems.Concatenation • Principles • Multi-represented multimodal segments • Selection & concatenation costs • Optimal selection by DTW • Selection costs • Between features ormore complex phonologicalstructures • Between stored cues andcues computed by external models: e.g. prosody • Post-processing • Smoothing • Advantages/disadvantages + Quality of the synthetic speech (units from natural speech). MOS test (rule-based, concatenation, linear acoustic-to-visual mapping) : Concatenation is considered as almost equivalent to original movements Gibert et al, IEEE SS 2002 - Requires very large audiovisual database - Bad joins and/or inappropriate units are very visible

  10. Phonetic input Segmentation Audio HMM and state duration models Visual parameters HMM learning … HMM sequency Synthetic trajectories State duration generation Visual parameters generation a p Trajectory formation systems.HMM-based synthesis • Principles • Learning • Contextual phone-sized HMM • Static & dynamic parameters • Gaussian/multiGaussian pdfs • Generation • Selection of HMM • Distribution of phone durations among states (z-scoring) • Solving linear equations • Smoothing due to dynamic pdfs • Advantages/disadvantages + Statistical parametrical synthesis • Requires relatively small database • It can be easily modified for different applications (languages, speaking rate, emotions, …) MOS test (concatenation, hmm, linear acoustic-to-visual mapping) : In average HMM synthesis rated better than concatenation… but under-articulated Govokhina et al, Interspeech 2006

  11. Dictionnary: visual segments (geometric and articulatory) First improvement: TDAHMM+Concatenation Planning Phonetic input Unit selection/concatenation Articulatory score Execution Geometric score HMM synthesis

  12. 4 PHMM: Phased Hidden Markov Model

  13. AV asynchrony • Possible/known asynchrony • Non audible gestures: during silences (ex: pre-phonatory gestures), plosives, etc. • Visual salience with few acoustic impact • Anticipatory gestures: rounding within consonants (/stri/ vs/. /stry/) • Predominance of phonatory modes over articulation for determining phone boundaries • Cause (articulation) precedes effect (sound) • Modeling synchrony • Few attempts in AV recognition • Coupled HMMs: Alissali, 1996; Luettin et al, 2001; Gravier et al 2002 • Non significant improvements Hazen, 2005 • But AV fusion more problematic than timing • Very few in AV synthesis • Okadome et al

  14. PHMM: Phased Hidden Markov Model • Visual speech synthesis • Synchronizing gesture with soundboundaries in the state-of-the-art systems • Simultaneous automatic learning • Classical HMM learning applied to articulatory parameters • Proposed audiovisual delays learning algorithm is applied. This iterative analysis by synthesis algorithm is based on Viterbi algorithm. • Simple phasing model: averagedelay associated with eachcontext-dependent HMM • Tested using FT AV database

  15. 250 200 150 100 50 0 -50 -100 F.ph. L.ph. Unr.V. R.V. Sv. Blb. Alv. Lbd. Cons. Results • Rapid convergence • Within a few iterations • But constraints • Simple phasing model • Min. durations for gestures • Large improvement • 10% for context-independent HMMs • Combines to context • Further & larger improvement for context-dependent HMMs • Significant delays • Largest for first & last segments (prephonatory gestures ~150 msec) • Positive for vowels, glides and bilabials • Negative for back and nasal consonants • In accordance with Öhman numerical theory of coarticulation: slow vocalic gestures expand whereas rapid consonantal gestures shrink

  16. Original HMM synthesis PHMM synthesis Illustration • Features • Prephonation • Postphonation (see final /o/) • Rounding (see /ɥi/): longer gestural duration enables complete protrusion)

  17. Conclusions • Speech-specific trajectory formation models • Trainable and parameterized by data • TDA: robustness & detailed articulation • PHMM: learning phasing relations between modalities • Perspectives • Combining TDA and PHMM • Notably segmenting multimodal units using PHMM • Subjective evaluation • Intelligibility, adequacy & cognitive load • PHMM • More sophisticated phasing models: regression trees, etc • Using state boundaries as possible anchor points • Applying to other gestures: CS, deictic/iconic gestures that should be coordinated with speech

  18. Examples

  19. Thank you for your attention • For further details • Mail me at : oxana.govokhina@orange-ftgroup.com

  20. Temporal information on phonetic boundaries (audio segmentation: SA) Classical context-dependent HMM learning on articualtory parameters Phoneme realignment on articulatory parameters by Viterbi Average audiovisual delay. Constraint of minimal phoneme duration (30 ms) SV(i) 1 2 5 3 Stop if Corr(SV(i), SV(i-1))1 Visual segmentation (SV) calculated from average audiovisual delay model and audio segmentation (SA) 4 PHMM

  21. Examples

More Related