Learning optimal audiovisual phasing for an HMM-based control model for facial animation

Learning optimal audiovisual phasing for an HMM-based control model for facial animation O. Govokhina (1,2), G. Bailly (2), G. Breton (1) (1) France Telecom R&D – Rennes (2) GIPSA-lab, dpt. Parole&Cognition – Grenoble SSW6, Bonn, August 2007

Agenda 1 • Facial animation • Data and articulatory model • Trajectory formation models • State of the art • First improvement: Task-Dynamics for Animation (TDA) • Multimodal coordination • AV asynchrony • PHMM: Phased Hidden Markov Model • Results and conclusions 2 3 4

1 Facial Animation

Analysis AV Data Learning Motion Capture Control model Shape model Appearance model Phonetic input Synthesis Facial Animation • Domain: Visual speech synthesis • Control model • Computes multiparametric trajectories from phonetic input • Shape model • Specifies how facial geometry is modified by articulatory parameters • Appearance model • Final image rendering • Data from Motion Capture Facial animation

2 Data and articulatory model

aperture width protrusion Data and articulatory model • Audiovisual database FT • 540 sentences, one female subject • 150 colored beads, automatic tracking • Cloning methodology developed at ICP Badin et al., 2002; Revéret et al., 2000 • Visual parameters: • 3 geometric parameters: Lips aperture/closure, Lips width, Lips protrusion • 6 articulatory parameters: Jaw opening, Jaw advance, Lips rounding, Upper lip movements, Lower lip movements, Throat movements

3 Trajectory formation systems

Trajectory formation systems.State of the art • Control models • Visual-only • Coarticulation models • Massaro-Cohen; Öhman, … • Triphones, cinematic models • Deng; Okadome, Kaburagi & Honda, … • From acoustics • Linear vs. Nonlinear mappings • Yehia et al; Berthommier • Nakamura et al: voice conversion (GMM, HMM) used for speech to articulatory inversion • Multimodal • Synthesis by concatenation • Minnis et al; Bailly et al, ... • HMM synthesis • Masuko et al; Tokuda et al, …

Unit selection/concatenation Linguistic processing Prosodic model Parametric synthesis Trajectory formation systems.Concatenation • Principles • Multi-represented multimodal segments • Selection & concatenation costs • Optimal selection by DTW • Selection costs • Between features ormore complex phonologicalstructures • Between stored cues andcues computed by external models: e.g. prosody • Post-processing • Smoothing • Advantages/disadvantages + Quality of the synthetic speech (units from natural speech). MOS test (rule-based, concatenation, linear acoustic-to-visual mapping) : Concatenation is considered as almost equivalent to original movements Gibert et al, IEEE SS 2002 - Requires very large audiovisual database - Bad joins and/or inappropriate units are very visible

Phonetic input Segmentation Audio HMM and state duration models Visual parameters HMM learning … HMM sequency Synthetic trajectories State duration generation Visual parameters generation a p Trajectory formation systems.HMM-based synthesis • Principles • Learning • Contextual phone-sized HMM • Static & dynamic parameters • Gaussian/multiGaussian pdfs • Generation • Selection of HMM • Distribution of phone durations among states (z-scoring) • Solving linear equations • Smoothing due to dynamic pdfs • Advantages/disadvantages + Statistical parametrical synthesis • Requires relatively small database • It can be easily modified for different applications (languages, speaking rate, emotions, …) MOS test (concatenation, hmm, linear acoustic-to-visual mapping) : In average HMM synthesis rated better than concatenation… but under-articulated Govokhina et al, Interspeech 2006

Dictionnary: visual segments (geometric and articulatory) First improvement: TDAHMM+Concatenation Planning Phonetic input Unit selection/concatenation Articulatory score Execution Geometric score HMM synthesis

4 PHMM: Phased Hidden Markov Model

AV asynchrony • Possible/known asynchrony • Non audible gestures: during silences (ex: pre-phonatory gestures), plosives, etc. • Visual salience with few acoustic impact • Anticipatory gestures: rounding within consonants (/stri/ vs/. /stry/) • Predominance of phonatory modes over articulation for determining phone boundaries • Cause (articulation) precedes effect (sound) • Modeling synchrony • Few attempts in AV recognition • Coupled HMMs: Alissali, 1996; Luettin et al, 2001; Gravier et al 2002 • Non significant improvements Hazen, 2005 • But AV fusion more problematic than timing • Very few in AV synthesis • Okadome et al

PHMM: Phased Hidden Markov Model • Visual speech synthesis • Synchronizing gesture with soundboundaries in the state-of-the-art systems • Simultaneous automatic learning • Classical HMM learning applied to articulatory parameters • Proposed audiovisual delays learning algorithm is applied. This iterative analysis by synthesis algorithm is based on Viterbi algorithm. • Simple phasing model: averagedelay associated with eachcontext-dependent HMM • Tested using FT AV database

250 200 150 100 50 0 -50 -100 F.ph. L.ph. Unr.V. R.V. Sv. Blb. Alv. Lbd. Cons. Results • Rapid convergence • Within a few iterations • But constraints • Simple phasing model • Min. durations for gestures • Large improvement • 10% for context-independent HMMs • Combines to context • Further & larger improvement for context-dependent HMMs • Significant delays • Largest for first & last segments (prephonatory gestures ~150 msec) • Positive for vowels, glides and bilabials • Negative for back and nasal consonants • In accordance with Öhman numerical theory of coarticulation: slow vocalic gestures expand whereas rapid consonantal gestures shrink

Original HMM synthesis PHMM synthesis Illustration • Features • Prephonation • Postphonation (see final /o/) • Rounding (see /ɥi/): longer gestural duration enables complete protrusion)

Conclusions • Speech-specific trajectory formation models • Trainable and parameterized by data • TDA: robustness & detailed articulation • PHMM: learning phasing relations between modalities • Perspectives • Combining TDA and PHMM • Notably segmenting multimodal units using PHMM • Subjective evaluation • Intelligibility, adequacy & cognitive load • PHMM • More sophisticated phasing models: regression trees, etc • Using state boundaries as possible anchor points • Applying to other gestures: CS, deictic/iconic gestures that should be coordinated with speech

Examples

Thank you for your attention • For further details • Mail me at : oxana.govokhina@orange-ftgroup.com

Temporal information on phonetic boundaries (audio segmentation: SA) Classical context-dependent HMM learning on articualtory parameters Phoneme realignment on articulatory parameters by Viterbi Average audiovisual delay. Constraint of minimal phoneme duration (30 ms) SV(i) 1 2 5 3 Stop if Corr(SV(i), SV(i-1))1 Visual segmentation (SV) calculated from average audiovisual delay model and audio segmentation (SA) 4 PHMM

Examples

Learning optimal audiovisual phasing for an HMM-based control model for facial animation

Learning optimal audiovisual phasing for an HMM-based control model for facial animation

Presentation Transcript

Vision-based Control of 3D Facial Animation

Identification for industrial model-based control

HMM An Initial Study on HMM-based TTS for Mandarin Chinese

Vision-based Control of 3D Facial Animation

Autonomous Motion Learning for Near Optimal Control

Facial Animation

A novel irregular voice model for HMM-based speech synthesis

An isometric model for facial animation and beyond

Fast State Discovery for HMM Model Selection and Learning

Initial HMM Model

Optimal control for integrodifferencequations

Learning optimal audiovisual phasing for an HMM-based control model for facial animation

Model Based Control for a Pressure Control Type CVT

An audiovisual fund for Slovakia – an hypothesis

An Incremental Sampling-based Algorithm for Stochastic Optimal Control

Statistical learning and optimal control: A framework for biological learning and motor control

EXAMPLE-BASED SIMULATION, GEOMETRY ACQUISITION, FACIAL ANIMATION

An Optimal Control Model for Traffic Corridor Management

Statistical learning and optimal control: A framework for biological learning and motor control

Identification for industrial model-based control

Model Based Control Strategies (Motor Learning)