210 likes | 214 Views
Learning optimal audiovisual phasing for an HMM-based control model for facial animation. O. Govokhina (1,2), G. Bailly (2), G. Breton (1) (1) France Telecom R&D – Rennes (2) GIPSA-lab, dpt. Parole&Cognition – Grenoble SSW6, Bonn, August 2007. Agenda. 1. Facial animation
E N D
Learning optimal audiovisual phasing for an HMM-based control model for facial animation O. Govokhina (1,2), G. Bailly (2), G. Breton (1) (1) France Telecom R&D – Rennes (2) GIPSA-lab, dpt. Parole&Cognition – Grenoble SSW6, Bonn, August 2007
Agenda 1 • Facial animation • Data and articulatory model • Trajectory formation models • State of the art • First improvement: Task-Dynamics for Animation (TDA) • Multimodal coordination • AV asynchrony • PHMM: Phased Hidden Markov Model • Results and conclusions 2 3 4
1 Facial Animation
Analysis AV Data Learning Motion Capture Control model Shape model Appearance model Phonetic input Synthesis Facial Animation • Domain: Visual speech synthesis • Control model • Computes multiparametric trajectories from phonetic input • Shape model • Specifies how facial geometry is modified by articulatory parameters • Appearance model • Final image rendering • Data from Motion Capture Facial animation
2 Data and articulatory model
aperture width protrusion Data and articulatory model • Audiovisual database FT • 540 sentences, one female subject • 150 colored beads, automatic tracking • Cloning methodology developed at ICP Badin et al., 2002; Revéret et al., 2000 • Visual parameters: • 3 geometric parameters: Lips aperture/closure, Lips width, Lips protrusion • 6 articulatory parameters: Jaw opening, Jaw advance, Lips rounding, Upper lip movements, Lower lip movements, Throat movements
3 Trajectory formation systems
Trajectory formation systems.State of the art • Control models • Visual-only • Coarticulation models • Massaro-Cohen; Öhman, … • Triphones, cinematic models • Deng; Okadome, Kaburagi & Honda, … • From acoustics • Linear vs. Nonlinear mappings • Yehia et al; Berthommier • Nakamura et al: voice conversion (GMM, HMM) used for speech to articulatory inversion • Multimodal • Synthesis by concatenation • Minnis et al; Bailly et al, ... • HMM synthesis • Masuko et al; Tokuda et al, …
Unit selection/concatenation Linguistic processing Prosodic model Parametric synthesis Trajectory formation systems.Concatenation • Principles • Multi-represented multimodal segments • Selection & concatenation costs • Optimal selection by DTW • Selection costs • Between features ormore complex phonologicalstructures • Between stored cues andcues computed by external models: e.g. prosody • Post-processing • Smoothing • Advantages/disadvantages + Quality of the synthetic speech (units from natural speech). MOS test (rule-based, concatenation, linear acoustic-to-visual mapping) : Concatenation is considered as almost equivalent to original movements Gibert et al, IEEE SS 2002 - Requires very large audiovisual database - Bad joins and/or inappropriate units are very visible
Phonetic input Segmentation Audio HMM and state duration models Visual parameters HMM learning … HMM sequency Synthetic trajectories State duration generation Visual parameters generation a p Trajectory formation systems.HMM-based synthesis • Principles • Learning • Contextual phone-sized HMM • Static & dynamic parameters • Gaussian/multiGaussian pdfs • Generation • Selection of HMM • Distribution of phone durations among states (z-scoring) • Solving linear equations • Smoothing due to dynamic pdfs • Advantages/disadvantages + Statistical parametrical synthesis • Requires relatively small database • It can be easily modified for different applications (languages, speaking rate, emotions, …) MOS test (concatenation, hmm, linear acoustic-to-visual mapping) : In average HMM synthesis rated better than concatenation… but under-articulated Govokhina et al, Interspeech 2006
Dictionnary: visual segments (geometric and articulatory) First improvement: TDAHMM+Concatenation Planning Phonetic input Unit selection/concatenation Articulatory score Execution Geometric score HMM synthesis
4 PHMM: Phased Hidden Markov Model
AV asynchrony • Possible/known asynchrony • Non audible gestures: during silences (ex: pre-phonatory gestures), plosives, etc. • Visual salience with few acoustic impact • Anticipatory gestures: rounding within consonants (/stri/ vs/. /stry/) • Predominance of phonatory modes over articulation for determining phone boundaries • Cause (articulation) precedes effect (sound) • Modeling synchrony • Few attempts in AV recognition • Coupled HMMs: Alissali, 1996; Luettin et al, 2001; Gravier et al 2002 • Non significant improvements Hazen, 2005 • But AV fusion more problematic than timing • Very few in AV synthesis • Okadome et al
PHMM: Phased Hidden Markov Model • Visual speech synthesis • Synchronizing gesture with soundboundaries in the state-of-the-art systems • Simultaneous automatic learning • Classical HMM learning applied to articulatory parameters • Proposed audiovisual delays learning algorithm is applied. This iterative analysis by synthesis algorithm is based on Viterbi algorithm. • Simple phasing model: averagedelay associated with eachcontext-dependent HMM • Tested using FT AV database
250 200 150 100 50 0 -50 -100 F.ph. L.ph. Unr.V. R.V. Sv. Blb. Alv. Lbd. Cons. Results • Rapid convergence • Within a few iterations • But constraints • Simple phasing model • Min. durations for gestures • Large improvement • 10% for context-independent HMMs • Combines to context • Further & larger improvement for context-dependent HMMs • Significant delays • Largest for first & last segments (prephonatory gestures ~150 msec) • Positive for vowels, glides and bilabials • Negative for back and nasal consonants • In accordance with Öhman numerical theory of coarticulation: slow vocalic gestures expand whereas rapid consonantal gestures shrink
Original HMM synthesis PHMM synthesis Illustration • Features • Prephonation • Postphonation (see final /o/) • Rounding (see /ɥi/): longer gestural duration enables complete protrusion)
Conclusions • Speech-specific trajectory formation models • Trainable and parameterized by data • TDA: robustness & detailed articulation • PHMM: learning phasing relations between modalities • Perspectives • Combining TDA and PHMM • Notably segmenting multimodal units using PHMM • Subjective evaluation • Intelligibility, adequacy & cognitive load • PHMM • More sophisticated phasing models: regression trees, etc • Using state boundaries as possible anchor points • Applying to other gestures: CS, deictic/iconic gestures that should be coordinated with speech
Thank you for your attention • For further details • Mail me at : oxana.govokhina@orange-ftgroup.com
Temporal information on phonetic boundaries (audio segmentation: SA) Classical context-dependent HMM learning on articualtory parameters Phoneme realignment on articulatory parameters by Viterbi Average audiovisual delay. Constraint of minimal phoneme duration (30 ms) SV(i) 1 2 5 3 Stop if Corr(SV(i), SV(i-1))1 Visual segmentation (SV) calculated from average audiovisual delay model and audio segmentation (SA) 4 PHMM