350 likes | 568 Views
Create Photo-Realistic Talking Face. Changbo Hu 2001.11.26 * This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang. Outline. Introduction of talking face Motivations System overview Techniques Conclusions. Introduction. What is a talking face
E N D
Create Photo-Realistic Talking Face Changbo Hu 2001.11.26 *This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang
Outline • Introduction of talking face • Motivations • System overview • Techniques • Conclusions
Introduction • What is a talking face • Face (lip) animation, driven by voice • Applications • The process of talking face • Face model • Motion capture • Mapping between audio and video • Rendering, Photo-realistic?
Literatures • Walter,93, DecFace, 2Dwire frame model • Terzopoulos,95, Skin and muscle model • Breglar,97, Video Rewrite, Sample image based • TS Huang,98,Mesh model from range data • Poggio,98, MikeTalk, Viseme morphing • Guenter,99, Making face, 3D from multicamera • Zhengyou Zhang, 00, 3D face modeling from video through epipolar constraint • Cosatto,00, Planar quads model
Motivations • Aim: a graphics interface for conversation agent • Photo-realistic • Driven by Chinese • Smooth connection between sentences • Extended from “Video rewrite”
System overview: Pipeline of the system(2) New text TTS system Wav sound Segmentation Triphone sequence Train database Synthesized triphone sequence Background sequence Lip motion sequence Rewrite to faces
Techniques • Analysis: • Audio process • Image process • Synthesis • Lip image • Background image • Stitch together
Audio part:Sound Segmentation • Given the wav file and the script • Using HMM to train the segment system • Segment wav file to phoneme sequence • Example of the segmentation result: SILOPEN 0 23 SILOPEN 24 42 s 43 61 if4 62 74 j 75 80 ia1 81 97 sh 98 109 ang1 110 121 y 122 130 e4 131 133 y 134 145 in2 146 154 h 155 164 ang2 165 194
Annotation with Phoneme • Using phoneme to annotate video frames • Each phoneme in a sentence corresponds to a short time of video sequence
Phoneme Distance Analysis • Phoneme&triphone basics • Chinese Phoneme vs. English Phoneme • Distance Metrics definitions • Results
Phoneme Basics • Phonemes represents the basic elements in speech. All possible speech can be represented by combination of phonemes. CH, JH, S, EH, EY, OY, AE, SIL… • Triphone are three consecutive phonemes. It not only represents pronounce characteristics but also contains context information. T-IY-P, IY-P-AA, P-AA-T…
Chinese Phoneme vs. English • Chinese phoneme has two basic groups: Initials and Finals. Initials: B, P, M, F, … Finals: a3, o1, e2, eng3, iang4, ue5, … • Chinese finals each has 5 tones: 1,2,3,4,5. Different tones: a1, a2, a3, a4, a5. • Chinese finals actually is not a basic elements of speech. For example: iang1, iao1, uang1, iong1… • Chinese phoneme set is much larger than English.
Phoneme Distance Analysis • Define the distance between any two phonemes. • Since we only synthesis video but not sound, so tone is ignored • Lip shape motion is the core element for distance metrics.
Phoneme Distance Analysis Phoneme 1: Video 1 Video 2 Video 3 Video 4 Video 1 Video 2 Video 3 Video 4 Video Average Time Align to an uniform length Average the videos to get an average video Phoneme 2: Video 1 Video 2 Video 1 Video 2 Video Average By comparing the two aligned average videos, we generate the distance matrix of the whole phoneme set.
Image part: Pose Tracking • Assume a plane model for face • Standard minimization method to find transform matrix (affine transform)[Black,95] • Mask is used to constrain interests part of the face Template Picture Mask Image
Pose tracking • Motion prediction using parameters with physical meaning
Pose Tracking Some tracking results:
Lip Motion Tracking • Using Eigen Points (Covell, 91) • Feature Points include Jaw, lip and teeth • Training database specified manually • Auto tracking through all pose-tracked images
Lip MotionTracking Train Database (hand-labeled) Auto Tracking Results
Synthesis new sentences • New text converted by TTS system to wav • Wav is segmented to phoneme sequence • Using DP to find an optimal video sequence from the training database • Time-align triphone videos and stitch them together. • Transform the lip sequence and paste them to background faces.
Lip sequence synthesis New phoneme sequences Optimal phoneme sequences New phoneme sequences Triphone 1 Triphone 4 Triphone 7 Triphone A Triphone 2 Triphone 5 Triphone 8 Triphone B Triphone 3 Triphone 6 Triphone 9 Triphone C
Dynamic Programming Begin End Triphone1 Triphone2 Triphone3 Triphone4 Triphone5
Edge Cost Definition • Two parts: • phoneme distance: 3 phonemes’ distances added together • Lip shape distance for the overlap portion of triphone video • Weighted add together two part
Background video generation • Background is a video sequence when the virtual character spoke something else • Similarity measurement of background • Select “standard frame” • The frame with maximal number of frames similar to it • Filter out the frames with jerkiness
Stitch the time-aligned result to background faces • Write back with a mask • Transform the synthesized lip to the background face
Mask image for write-back operation Original background frame Write-back result of the same frame
Conclusion and Future Work • Pose tracking and lip motion tracking • Size of the train database • Talking face with expression • Real-time generation? • Fast modeling for different person