550 likes | 728 Views
1. Video Rewrite: Driving Visual Speech with Audio. Christoph Bregler Michele Covell Malcolm Slaney Interval Research Corporation. 2. Goal: Photo-realistic Talking Face. Video Rewrite. Handcoded 3D Model. OR. 2. Facial Animation History:. Parke (1972)
E N D
1 Video Rewrite:Driving Visual Speech with Audio • Christoph Bregler • Michele Covell • Malcolm Slaney • Interval Research Corporation
2 Goal: Photo-realistic Talking Face Video Rewrite Handcoded 3D Model OR
2 Facial Animation History: • Parke (1972) • Cohen & Massaro, Benoit et al. (1993) • Waters & Terzopolous (1990), DEC-Face • Lewis (1991) • Litwinowicz & Williams (1994) • Chen, Graf, Petajan, et al (1995) • Scott et al (1994) • Ezzat & Poggio (1997) • Pighin et al + Gunter et al (1998) • Brand (1999) • Cosatto, Graf (2000)
3 Video Rewrite:Overview /D/ /IY/ /P/ /AH/ Analysis Synthesis
4 Video Rewrite:Overview /D/ /IY/ /P/ /AH/ Analysis Synthesis
/D/ /OH/ /AH/ /N/ • 5 Annotation • Phonetic • Head Pose • Mouth Shape
6 Phonetic Annotation HMM Labels /D/ /IY/ /P/ /AH/ /IY-P-AH/ /D-IY-P/
6 Phonetic Annotation • Acoustic Front-End: RASTA-PLP (Channel Invariant) • HMM Models / Gaussian Mixture Models (HTK) • Phoneme Set: 56 categories (CMU) • Triphone models trained on TIMIT • Annotation using Forced-Viterbi • (and CMU pronunciation dictionary)
/D/ /OH/ /AH/ /N/ • 5 Annotation • Phonetic • Head Pose • Mouth Shape
7 Head Pose Annotation match planar template
/D/ /OH/ /AH/ /N/ • 5 Annotation • Phonetic • Head Pose • Mouth Shape
8 Mouth / Chin Annotation Eigenpoints
8 Eigenpoints - Training - Graylevel + XY Control points
8 Eigenpoints - Mapping - Graylevel + XY Control point Space
9 Video Rewrite:Overview /D/ /IY/ /P/ /AH/ Analysis Synthesis
10 Video Rewrite:Overview /D/ /IY/ /P/ /AH/ Analysis Synthesis
11 Synthesis - Overview - background face
/J/ /EH/ /IY/ /L/ • 12 Synthesis: • Transcribe • Find Lip Clips • Stitch Together
/AA/ /T/ /AA/ • 13 Matching:
/AA/ /T/ /AA/ • 14 Matching: Co-Articulation / UW - T - UW/ ?
/AA/ /T/ /AA/ / UW - T - UW/ / AA - T - AA/ • 15 Matching: Co-Articulation match
16 Co-Articulation: Tri-Phones / UW - T - UW/ More than 20,000 Tri-Phones in English / AA - T - AA/ / AA - S - AA/ ….
16 Viseme based Perceptual match P B S T K … P B S T K … 11 Consonant Clusters: - CH, JH, SH, ZH - K, G, N, L - T, D, S, Z - P, B, M - F, V - TH, DH Owens (1985) Confusion Matrix
/AA/ /T/ /AA/ / UW - T - UW/ / AA - S - AA/ • 17 Matching: Viseme-Distance correct phone wrong context: correct viseme correct context:
/AA/ /T/ /AA/ / UW - T - UW/ / AA - S - AA/ • 18 Matching: Viseme-Distance approximate match
Matching: Overlapping Triphones • 18 Shape Distance
18 Matching: Trade-Offs /IY/ /P/ /AA/ /T/ /AA/ N-Viseme Distance Shape Distance Rate of Speech Distance
18 Matching: N-Best Dynamic Programming Error = S a V(t) + b R(t) + g S(t-1,t) N-best t
19 Stitching + +
20 Stitching + +
21 Stitching Morphing
21 Morphing Affine-Warp + Beier-Neely
21 Simple Lighting Correction Internsity 1.) X Alpha Blending 2.) X
22 Video Rewrite Results Ellen - Video Model 8 minutes data JFK - Video Model 2 minutes data
23 Contributions • Data-driven lip animation • Automatic using vision and speech recognition • Photo realistic: • implicitly captures specific appearance + dynamics
24 Video Rewrite Thanks ! Acknowledgments: S. Ahmad M. Bajura F. Crow T. Darrell M. Davis G. Gordon K. Force B. Fuson B. Lassiter J. Lewis K. Rahardja S. Snibbe C. Sequine E. Tauber B. Verplank S. White J. Woodfill John F. Kennedy
1994: Scott et al (JPL + Graphco Technologies) /e/ /o/ /n/