1 / 37

Audio-Driven Facial Animation: Learning Pose and Emotion from Audio Using CNNs

The paper explores creating a 3D face mesh solely from audio input by utilizing CNNs for low latency animation while factoring in emotions. It addresses the challenges of ambiguity in audio and details the CNN layout, working with emotions, training methods, and inference techniques for accurate emotion representation in facial animation. 8 Relevant

campbellm
Download Presentation

Audio-Driven Facial Animation: Learning Pose and Emotion from Audio Using CNNs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion Presented by: Iwan Boksebeld and Marijn Suijten

  2. Authors Tero Karras NVIDIA Timo Aila NVIDIA Samuli Laine NVIDIA Antti Herva Remedy Entertainment Jaakko Lehtinen NVIDIA & Aalto University

  3. Goals of the paper • Create 3D mesh from just audio • With the use of CNNs • While keeping low latency • And factoring in emotions

  4. The Problem • Create a full face mesh for added realism • Use emotions in the animation • Dealing with ambiguity of audio • Creating a CNN and training this

  5. Related Work

  6. Linguistic based animation • Input audio often with transcript • Animation results from language based rules • The strengths of this method is the high level of control • The weakness is the complexity of the system • Example of such a model is the Dominance model

  7. Machine learning techniques • Mostly in 2D • Learn the rules given in linguistic based animation • Blend and/or concatenate images to produce results • Not really useful for the application here

  8. Capturing emotions • Mostly based on user parameters • Some work with neural networks • Creates mapping from emotion parameters to facial expression

  9. The technical side

  10. Audio processing • 16 kHz mono; normalized volume • 260ms of past and future samples, total of 520ms • Value empiricallychosen • Take 64 audio frames of 16ms • 2x overlap: every 8ms used twice • Hann window: remove temporal aliasing effects

  11. Autocorrelation • Calculate K=32 autocorrelation coefficients • 12 enough for identifying individual phonemes • Need more to identify pitch • No special techniques for linear separation of phonemes • Tests indicate this process is clearly superior

  12. CNN Layout • Formant analysis network: • First layer is audio-processing and autocorrelation • Time axis of 64 samples • 32 autocorrelation coefficients • Followed by 5 convolution layers • Convert formant audio features to 256 abstract feature maps

  13. CNN Layout • Articulation network • Analyze temporal evolution • 5 layers as well • Emotion vector concatenated

  14. Working with emotions • Speech highly ambiguous • Consider silence: what does it look like?

  15. Representing Emotions • Emotional state stored as "meaningless" E-dimensional vector • Learns with the network • Vector concatenated to convolution layers in articulation network • Concatenated to every layer: significantly better result • Support early layers with nuanced control over details such as coarticulation • Later layers have more control over the output pose

  16. Training

  17. Training Target • Use 9 cameras to get unstructured mesh and optical flow • Project template mesh onto unstructured mesh • Link optical flow to template • Template mesh is then tracked across performance • Use some vertices to stabilize head • Limitation no tongue

  18. Training Data • Pangrams and in-character material • 3-5 minutes per actor (trade off quality vs. time/cost) • Pangrams: Designed sentences with as many sounds of a language • in-character: Capture emotions based on character narrative • Time-shifting data augmentation

  19. Loss Function • Loss function in 3-terms: • Position term • Motion term • Regularization term • Use normalization scheme to balance these terms

  20. Position Term • Ensure correct vertex location • V: # of vertices • y: desired position • ŷ: actual position

  21. Motion Term • Ensure correct motion • Comparing paired frames • m(~): Difference between paired frames

  22. Regularization Term • Ensure no erratic emotion • Normalized to prevent becoming ineffective • E: # of emotion components • e(i):ith component of the emotion vector for sample x

  23. Inference

  24. Inferring emotion • Step 1: Cull “wrong” emotion vectors • Bilabials -> closed mouth • Vowels -> opened mouth • Step 2: Visually inspect animation • Remove short-term effects • Step3: Use voice from different actor • Unnatural -> lack of generalization • Manually assign semantic meaning • Interpolate emotion vectors for transition/complex emotion

  25. Results

  26. Results

  27. User Study Setup • Blind user study with 20 participants • User were asked to choose between 2 which was more realistic • Two sets of experiments • Comparing against other methods • DM vs PC vs Ours • Audio from validation set not used in training • 13 clips of 3-8 seconds long • Generalization over language and gender • 14 clips form several languages • From online database without checking output

  28. User Study Results

  29. User Study Results

  30. User Study Results • Clearly better then DM • Still quite a bit worse then PC • Generalizes quite well over languages • Even compared to linguistic method

  31. Critical Review

  32. Drawbacks of solution • No residual motion • No blinking • No head movement • Assumes higher power handles these • Problems with similar looking sounds • E.g. Confuse B and G • Fast languages are a problem • Novel data needs to be somewhat similar to training data • Misses detail compared to PC • Emotions have no defined meaning

  33. Questions?

  34. Discussion

  35. Discussion • Is it useful to gather emotions like this paper describes? • Why not just tell what the emotion is?

  36. Discussion • Should they have used more participants?

  37. Discussion • Why do you think blinking and eye/head motion is not covered by the network?

More Related