MikeTalk:An Adaptive Man-Machine Interface

MikeTalk:An Adaptive Man-Machine Interface Tony Ezzat Volker Blanz Tomaso Poggio

TTVS Overview • Input: Text • Output: Photo-realistic talking face uttering text

Desktop Agents

You have received 1 email from Tommy Poggio. Desktop Agents

Customer Support

You have bought 20 shares of SONY at $40 each. Customer Support

Advertisements

Hi Tony, would you be interested in a ticket from Boston to New York for $50.00? Advertisements

Modules

Phoneme Corpus Step 1: • collect a visual corpus from a subject • corpus contains 44 words • one word for each American English phoneme

6 Consonantal Visemes Step 2: • extract one image per phoneme: viseme • group visemes together by visual similarity

9 Vocalic Visemes (+ 1 SilenceViseme)

Problem1:Need to Interpolate!

Solution: Morphing! Simultaneous interpolation of shape & texture. (Beier & Neely 1992) Problem 2: too tedious to specify correspondence by hand across many images!

Solution: Optical Flow (Horn & Schunk 1986) (Lucas & Kanade 1988) • To interpolate between two visemes, optical flow is first computed • A 2D motion vector field is produced: dx(x,y) dy(x,y)

Morphing • Forward warping A to B • Forward warping B to A • Blending • Holefilling

Synthesis Database • 16 Visemes total • 256 Optical flow vectors total, from every viseme to every other viseme

Concatenation and Lip Sync • Load the correct viseme transitions • Concatenate viseme transitions • Sample the viseme transitions using audio durations

Examples “1, 2, 3, 4, 5” “you have received 10 email messages.” “cat, dog, pig, cow, moose, horse, sheep”

Current Work • Coarticulation • Eye + head movements • Emotion • 3D instead of 2d • Psychophysics

3D With Volker Blanz

The End

Co-articulation • Problem: Current method does not handle coarticulation, so speech looks overly articulated • Can record all possible triphones/ quadriphones but this approach requires a lot of data! • Best method is to learn a model for coarticulation, but what is the representation for the lips?

Principal Components Analysis • Each image is a vector in a high-dimensional space • Using PCA, find the optimal set of vectors that span the space • Project the entire corpus onto those basis vectors

Top 2 PCA Bases for /buut/

Top 2 PCA Bases for /get/ Problem: Too nonlinear!

Flow Component Analysis • Compute optical from a reference lip image to all other images in the corpus • Compute PCA on all the flows

Top 2 FPCA Bases for /buut/

Top 2 FPCA Bases for /get/ Much more linear behavior!

Current Work • Now that we have parameterized the mouth, what is the model for mouth synthesis? • How is that model fit to the PCA data?

MikeTalk:An Adaptive Man-Machine Interface