Towards Perceptually Realistic Talking Heads: Models, Methods and McGurk David Marshall, Darren Cosker and Paul Rosin

Towards Perceptually Realistic Talking Heads: Models, Methods and McGurk David Marshall, Darren Cosker and Paul Rosin Cardiff School of Computer Science Susan Paddock and Simon Rushton Cardiff School of Psychology Cardiff University

Context: A Talking Head • Development of a Video-Realistic Talking Head • Animation from Continuous Speech • Perceptual Analysis -> Realism

Contribution of this Paper: Perceptual Realism Test • Perceptual Analysis via McGurk Test • Perceptual Test with no prior bias • Used to improve talking head synthesis

Outline of Talk • Video Realistic Talking Head (Overview) • Perceptual Analysis and Testing • The McGurk Effect + McGurk Test • Results : Implications of McGurk • Conclusions + Future Work

Our Talking Head • Image based synthesis • Continuous Speech • Flexible framework – emotion, behaviour BASIC IDEA: • Train on input video and audio • Extracting only low level image and audio features • No phonetic labelling • Synthesise new video using only input audio • Unseen utterances • Speaker Independent

Hierarchical Facial Model • Active Appearance Models – Control of shape and texture using single ‘appearance parameter’ • Based on Principal Component Analysis (PCA) • Non-linear Hierarchical PCA (developed at Cardiff) • Greater Separation of Variation • High Degree of Control – Sub-Facial variation not orthogonal in standard PCA model • Coupling of Speech Model (Cardiff Idea)

Building A Talking Head - Initialisation • For Each Video Frame Extract: • Shape – Key Landmark Points (Tracker Helps) • Textures – Colour Pixel Values Normalised to Shape • Speech Features – Mel-Cepstral, Linear Predictive Coding (LPC)

Building A Talking Head - Tracking • Semi Automated • Hand Place Few Frames • Build Interim Shape Model • Track Other Frames • Build Final Shape Model

Building A Talking Head - Learning/Model Building Active Appearance Model (AAM)-> Shape (PCA) and Texture (PCA) Speech/Appearance Model (SAAM NEW) -> Speech (PCA) and AAM • Nonlinear PCA: • Gaussian Mixture Model (GMM) • Model of Dynamics: • Hidden Markov Model (HMM)

Building A Talking Head - Synthesis + Reconstruction Input Speech -> Extract Speech Features + Find Best Clusters Bottom up reconstruction: Mouth Driven

Talking Head Examples

Talking Head Example:Independent Speaker

Perceptual Analysis of Talking Heads Current Talking Head Analysis Methods • Subjective Evaluation • Analyse and Compare Trajectories • Improved Perception in Noisy environments • Forced Choice Testing

Perceptual Analysis of Talking HeadsSubject and Trajectory Evaluation • Subjective Evaluation • Does it “look good”? • No formative comparison • No feedback to improve model • Analyse and Compare Trajectories • Ground truth quantitative assessment • Comparison to “seen” data • No perceptual quality measurement

Perceptual Analysis of Talking Heads:Noisy Environment Evaluation • Noisy Environment Evaluation • Perceptual Evaluation • Compare Performance of Synthetic v Real Talking Head in realistic situations • Good overall test of talking head • Lip-syncing, realism • No Quantitative Measure of Performance

Perceptual Analysis of Talking Heads:Forced Choice Testing Forced Choice Testing: • Users Asked if Video is Real or Synthetic • Only says if it looks realistic + lip sync is good • Big Prior Introduced • Users look for artefacts • Randomness Bias in User selection • Bored/Uninterested User • No Quantitative Feedback for Model Improvement • What makes it real/synthetic?

Perceptual Analysis of Talking Heads:An New McGurk Test • McGurk Test for Perceptual Analysis • Subject doesn’t develop a prior • Helps address strengths and weaknesses • Suggests improvements based on these • Compliments other tests

Perceptual Analysis of Talking Heads:The McGurk Effect MacDonald and McGurk (1976): • Auditory Syllable Dubbed onto Videotape of Different Syllables Gives Perception of and Entirely Different Syllable, e.g.: • Audio ‘Ba’ • Visual ‘Ga’ • Perception ‘Da’ • “Close Eyes – Illusion Vanishes” • Raises Psychological Audio-Visual questions: • How is Auditory and Visual Stimuli combined? • Why combine when audio is enough?

Perceptual Analysis of Talking Heads:Some More McGurk Effect Examples

Perceptual Analysis of Talking HeadsMcGurk Effect Examples (REAL)

Perceptual Analysis of Talking Heads:McGurk Effect Examples (ANSWERS) Tuple:Mat/Dead/Gnat Tuple:Bent/Vest/Vent

Perceptual Analysis of Talking HeadsMcGurk Effect Examples (Synthetic)

Perceptual Analysis of Talking Heads: McGurk Effect Examples (ANSWERS) Synthetic Examples Tuple: Fame/Face/Feign Tuple: Mat/Dead/Gnat

Perceptual Analysis of Talking Heads:Our McGurk Test McGurk Perceptual Evaluation Test: • Mix Real and Synthetic tuples. • What word do you perceive? • Users asked to note anything differences • NO PRIORS as to real/synthetic forced choice • User only asked about they hear/perceive • Best Viewing resolution • Tested different resolutions (72x75, 36x289, 720x576 pixels)

Perceptual Analysis of Talking Heads:Our McGurk Experimental Procedure • Mix of Real and Synthetic McGurk Examples • Real examples are a control • Users Presented with a series of 60 (30 real 30 Synthetic) random examples • Users asked only to focus on the mouth area • Two initial example “training” sequences (not in trial) • Soundproofed booths with adjustable volume and artificial lighting • Replay option for all example • Users simply record the word they perceive • Users asked three questions after viewing all clips • “Did you notice anything about the videos that you can comment on?” • “Could you tell that some of the videos were computer generated?” • “Did you use the replay button at all?” • 20 psychology undergrad test subjects (4 Male/16 female) with normal hearing/vision

Perceptual Analysis of Talking Heads: How is Our McGurk Test a Test • How is this a test? • Correct Lip Synch = McGurk Effect • Incorrect Lip Synch = Audio/Other • Audio should be dominant • Questions Assess Behaviour/Output • After test procedure participants asked whether they noticed anything unnatural?

Perceptual Analysis of Talking Heads:Results • Four Types of Analysis of Results: • Standard McGurk Response • From tuples form accepted audio and accepted McGurk response • Original McGurk observation • Enhanced McGurk Response • Assemble a List of All participants McGurk Reponses • Allows for greater variability in accents/articulation • Allows for greater analysis and Improvement of Head Models • Effects of Resolution on McGurk Effect • End of Test Questions Analysis • General overall response, qualitative analysis

Perceptual Analysis of Talking Heads: Standard McGurk Response

Perceptual Analysis of Talking Heads:Enhanced McGurk Response

Perceptual Analysis of Talking Heads:Image Resolution

Perceptual Analysis of Talking Heads:End of Test Questions Results • “Notice anything to comment on?” Some audio didn’t match video • “Could you tell some synthetic?” No, 1 participant = some unnatural? • “Did you use replay?” Few = once, One = twice

Perceptual Analysis of Talking Heads:Overall Results Analysis • Realistic behaviour • Most users were unaware of synthetic output • More McGurk effects in real output • Points to some weakness in model • Good Synthesis of /F/, /D/, /S/, /A/ and /E/ • Poor Synthesis of /V/ • Some weak real and synthetic McGurk responses • Beige-Gaze-Deige -> 2X Audio v McGurk • Mock-Dock-Knock -> 50:50 Audio:McGurk • Resolution has effect on real only • Due to overall lower synthetic McGurk response

Conclusions • Suggested a perceptual approach to analysis and development of a Talking Head • Unbiased by prior forced choice making • Insight into performance of algorithms • Complements other tests

Perceptual Analysis of Talking Heads:Future Work • Talking Head • Full Emotion • Performance Driven Animation • 3D Modelling • Full 3D appearance modelling • Other perceptual tests • Longer videos – McGurk sentences • Real/Synthesised correct lip synch: McGurk = bad synch? • Emotion – A McGurk emotion test?

Web Links • Paper Downloads www.cs.cf.ac.uk/user/D.P.Cosker/publications.html www.cs.cf.ac.uk/Dave/Publications.html • McGurk Video Clips and McGurk Test Software (Macromedia Director) www.cs.cf.ac.uk/user/D.P.Cosker/McGurk/

Towards Perceptually Realistic Talking Heads: Models, Methods and McGurk David Marshall, Darren Cosker and Paul Rosin