Video Rewrite: Driving Visual Speech with Audio

1 Video Rewrite:Driving Visual Speech with Audio • Christoph Bregler • Michele Covell • Malcolm Slaney • Interval Research Corporation

2 Goal: Photo-realistic Talking Face Video Rewrite Handcoded 3D Model OR

2 Facial Animation History: • Parke (1972) • Cohen & Massaro, Benoit et al. (1993) • Waters & Terzopolous (1990),  DEC-Face • Lewis (1991) • Litwinowicz & Williams (1994) • Chen, Graf, Petajan, et al (1995) • Scott et al (1994) • Ezzat & Poggio (1997) • Pighin et al + Gunter et al (1998) • Brand (1999) • Cosatto, Graf (2000)

3 Video Rewrite:Overview /D/ /IY/ /P/ /AH/ Analysis Synthesis

/D/ /OH/ /AH/ /N/ • 5 Annotation • Phonetic • Head Pose • Mouth Shape

6 Phonetic Annotation HMM Labels /D/ /IY/ /P/ /AH/ /IY-P-AH/ /D-IY-P/

6 Phonetic Annotation • Acoustic Front-End: RASTA-PLP (Channel Invariant) • HMM Models / Gaussian Mixture Models (HTK) • Phoneme Set: 56 categories (CMU) • Triphone models trained on TIMIT • Annotation using Forced-Viterbi • (and CMU pronunciation dictionary)

7 Head Pose Annotation match planar template

8 Mouth / Chin Annotation Eigenpoints

8 Eigenpoints - Training - Graylevel + XY Control points

8 Eigenpoints - Mapping - Graylevel + XY Control point Space

11 Synthesis - Overview - background face

/J/ /EH/ /IY/ /L/ • 12 Synthesis: • Transcribe • Find Lip Clips • Stitch Together

/AA/ /T/ /AA/ • 13 Matching:

/AA/ /T/ /AA/ • 14 Matching: Co-Articulation / UW - T - UW/ ?

/AA/ /T/ /AA/ / UW - T - UW/ / AA - T - AA/ • 15 Matching: Co-Articulation match

16 Co-Articulation: Tri-Phones / UW - T - UW/ More than 20,000 Tri-Phones in English / AA - T - AA/ / AA - S - AA/ ….

16 Viseme based Perceptual match P B S T K … P B S T K … 11 Consonant Clusters: - CH, JH, SH, ZH - K, G, N, L - T, D, S, Z - P, B, M - F, V - TH, DH Owens (1985) Confusion Matrix

McGurk Effect -- Baldy by Cohen & Massaro

/AA/ /T/ /AA/ / UW - T - UW/ / AA - S - AA/ • 17 Matching: Viseme-Distance correct phone wrong context: correct viseme correct context:

/AA/ /T/ /AA/ / UW - T - UW/ / AA - S - AA/ • 18 Matching: Viseme-Distance approximate match

Matching: Overlapping Triphones • 18 Shape Distance

18 Matching: Trade-Offs /IY/ /P/ /AA/ /T/ /AA/ N-Viseme Distance Shape Distance Rate of Speech Distance

18 Matching: N-Best Dynamic Programming Error = S a V(t) + b R(t) + g S(t-1,t) N-best t

19 Stitching + +

20 Stitching + +

21 Stitching Morphing

21 Morphing Affine-Warp + Beier-Neely

21 Simple Lighting Correction Internsity 1.) X Alpha Blending 2.) X

22 Video Rewrite Results Ellen - Video Model 8 minutes data JFK - Video Model 2 minutes data

23 Contributions • Data-driven lip animation • Automatic using vision and speech recognition • Photo realistic: • implicitly captures specific appearance + dynamics

24 Video Rewrite Thanks ! Acknowledgments: S. Ahmad M. Bajura F. Crow T. Darrell M. Davis G. Gordon K. Force B. Fuson B. Lassiter J. Lewis K. Rahardja S. Snibbe C. Sequine E. Tauber B. Verplank S. White J. Woodfill John F. Kennedy

1994: Scott et al (JPL + Graphco Technologies) /e/ /o/ /n/

1994: Scott et al (JPL + Graphco Technologies)

Video Rewrite: Driving Visual Speech with Audio