300 likes | 441 Views
MikeTalk:An Adaptive Man-Machine Interface. Tony Ezzat Volker Blanz Tomaso Poggio. TTVS Overview. Input: Text Output: Photo-realistic talking face uttering text. Desktop Agents. You have received 1 email from Tommy Poggio. Desktop Agents. Customer Support. You have bought 20
E N D
MikeTalk:An Adaptive Man-Machine Interface Tony Ezzat Volker Blanz Tomaso Poggio
TTVS Overview • Input: Text • Output: Photo-realistic talking face uttering text
You have received 1 email from Tommy Poggio. Desktop Agents
You have bought 20 shares of SONY at $40 each. Customer Support
Hi Tony, would you be interested in a ticket from Boston to New York for $50.00? Advertisements
Phoneme Corpus Step 1: • collect a visual corpus from a subject • corpus contains 44 words • one word for each American English phoneme
6 Consonantal Visemes Step 2: • extract one image per phoneme: viseme • group visemes together by visual similarity
Solution: Morphing! Simultaneous interpolation of shape & texture. (Beier & Neely 1992) Problem 2: too tedious to specify correspondence by hand across many images!
Solution: Optical Flow (Horn & Schunk 1986) (Lucas & Kanade 1988) • To interpolate between two visemes, optical flow is first computed • A 2D motion vector field is produced: dx(x,y) dy(x,y)
Morphing • Forward warping A to B • Forward warping B to A • Blending • Holefilling
Synthesis Database • 16 Visemes total • 256 Optical flow vectors total, from every viseme to every other viseme
Concatenation and Lip Sync • Load the correct viseme transitions • Concatenate viseme transitions • Sample the viseme transitions using audio durations
Examples “1, 2, 3, 4, 5” “you have received 10 email messages.” “cat, dog, pig, cow, moose, horse, sheep”
Current Work • Coarticulation • Eye + head movements • Emotion • 3D instead of 2d • Psychophysics
3D With Volker Blanz
Co-articulation • Problem: Current method does not handle coarticulation, so speech looks overly articulated • Can record all possible triphones/ quadriphones but this approach requires a lot of data! • Best method is to learn a model for coarticulation, but what is the representation for the lips?
Principal Components Analysis • Each image is a vector in a high-dimensional space • Using PCA, find the optimal set of vectors that span the space • Project the entire corpus onto those basis vectors
Top 2 PCA Bases for /get/ Problem: Too nonlinear!
Flow Component Analysis • Compute optical from a reference lip image to all other images in the corpus • Compute PCA on all the flows
Top 2 FPCA Bases for /get/ Much more linear behavior!
Current Work • Now that we have parameterized the mouth, what is the model for mouth synthesis? • How is that model fit to the PCA data?