380 likes | 447 Views
Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion. Presented by: Iwan Boksebeld and Marijn Suijten. Authors. Tero Karras NVIDIA. Timo Aila NVIDIA. Samuli Laine NVIDIA. Antti Herva Remedy Entertainment. Jaakko Lehtinen NVIDIA & Aalto University.
E N D
Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion Presented by: Iwan Boksebeld and Marijn Suijten
Authors Tero Karras NVIDIA Timo Aila NVIDIA Samuli Laine NVIDIA Antti Herva Remedy Entertainment Jaakko Lehtinen NVIDIA & Aalto University
Goals of the paper • Create 3D mesh from just audio • With the use of CNNs • While keeping low latency • And factoring in emotions
The Problem • Create a full face mesh for added realism • Use emotions in the animation • Dealing with ambiguity of audio • Creating a CNN and training this
Linguistic based animation • Input audio often with transcript • Animation results from language based rules • The strengths of this method is the high level of control • The weakness is the complexity of the system • Example of such a model is the Dominance model
Machine learning techniques • Mostly in 2D • Learn the rules given in linguistic based animation • Blend and/or concatenate images to produce results • Not really useful for the application here
Capturing emotions • Mostly based on user parameters • Some work with neural networks • Creates mapping from emotion parameters to facial expression
Audio processing • 16 kHz mono; normalized volume • 260ms of past and future samples, total of 520ms • Value empiricallychosen • Take 64 audio frames of 16ms • 2x overlap: every 8ms used twice • Hann window: remove temporal aliasing effects
Autocorrelation • Calculate K=32 autocorrelation coefficients • 12 enough for identifying individual phonemes • Need more to identify pitch • No special techniques for linear separation of phonemes • Tests indicate this process is clearly superior
CNN Layout • Formant analysis network: • First layer is audio-processing and autocorrelation • Time axis of 64 samples • 32 autocorrelation coefficients • Followed by 5 convolution layers • Convert formant audio features to 256 abstract feature maps
CNN Layout • Articulation network • Analyze temporal evolution • 5 layers as well • Emotion vector concatenated
Working with emotions • Speech highly ambiguous • Consider silence: what does it look like?
Representing Emotions • Emotional state stored as "meaningless" E-dimensional vector • Learns with the network • Vector concatenated to convolution layers in articulation network • Concatenated to every layer: significantly better result • Support early layers with nuanced control over details such as coarticulation • Later layers have more control over the output pose
Training Target • Use 9 cameras to get unstructured mesh and optical flow • Project template mesh onto unstructured mesh • Link optical flow to template • Template mesh is then tracked across performance • Use some vertices to stabilize head • Limitation no tongue
Training Data • Pangrams and in-character material • 3-5 minutes per actor (trade off quality vs. time/cost) • Pangrams: Designed sentences with as many sounds of a language • in-character: Capture emotions based on character narrative • Time-shifting data augmentation
Loss Function • Loss function in 3-terms: • Position term • Motion term • Regularization term • Use normalization scheme to balance these terms
Position Term • Ensure correct vertex location • V: # of vertices • y: desired position • ŷ: actual position
Motion Term • Ensure correct motion • Comparing paired frames • m(~): Difference between paired frames
Regularization Term • Ensure no erratic emotion • Normalized to prevent becoming ineffective • E: # of emotion components • e(i):ith component of the emotion vector for sample x
Inferring emotion • Step 1: Cull “wrong” emotion vectors • Bilabials -> closed mouth • Vowels -> opened mouth • Step 2: Visually inspect animation • Remove short-term effects • Step3: Use voice from different actor • Unnatural -> lack of generalization • Manually assign semantic meaning • Interpolate emotion vectors for transition/complex emotion
User Study Setup • Blind user study with 20 participants • User were asked to choose between 2 which was more realistic • Two sets of experiments • Comparing against other methods • DM vs PC vs Ours • Audio from validation set not used in training • 13 clips of 3-8 seconds long • Generalization over language and gender • 14 clips form several languages • From online database without checking output
User Study Results • Clearly better then DM • Still quite a bit worse then PC • Generalizes quite well over languages • Even compared to linguistic method
Drawbacks of solution • No residual motion • No blinking • No head movement • Assumes higher power handles these • Problems with similar looking sounds • E.g. Confuse B and G • Fast languages are a problem • Novel data needs to be somewhat similar to training data • Misses detail compared to PC • Emotions have no defined meaning
Discussion • Is it useful to gather emotions like this paper describes? • Why not just tell what the emotion is?
Discussion • Should they have used more participants?
Discussion • Why do you think blinking and eye/head motion is not covered by the network?