End-to-End Speech-Driven Facial Animation with Temporal GANs

End-to-End Speech-Driven Facial Animation with Temporal GANs Patrick Groot Koerkamp (6628478)

High level overview • Generating videos of a talking head • Audio synchronized lip movements • Natural facial expressions (Blinks and Eyebrow movements) • Temporal GAN

Generative Adversarial Networks (GAN) • Generator • Discriminator

Motivation • Simplify film animation process • Better lip-syncing • Generate parts of occluded faces • Improve band-limited visual telecommunications

Background • Generate realistic faces • Mapping Audio Features (MFCC) • Computer Graphics • Overhead • Transform Audio Features to Video Frames • Neglect facial expressions • Generate on present information • No facial dynamics • Challenging

Proposal / Contributions • GAN capable of generating videos • Audio signal • Single still image • Subject independent • No handcrafted audio • No visual feature reliance • No post processing • Comprehensive assessment • Method performance • Image quality • Lip-reading verification • Identity maintaining • Realism (Turing)

Related work • Speech-Driven Facial Animation • Acoustics, Vocal-tract, Facial motion • Hidden Markov Models (HMM) • Deep neural networks • Convolutional neural networks • GAN-Based Video Synthesis • Image/Video generation • MoCoGAN • Cross-modal applications

End-to-End Speech-Driven Facial Synthesis • 1 Generator • ReLU > TanH • 2 Discriminators • ReLU > Sigmoid

Generator • Identity Encoder • Context Encoder • Audio Encoder • RNN • Frame Decoder • Noise Generator

Audio Encoder & Context Encoder • Audio Encoder • 7 Layer CNN • Extracts 256 dimensional features • Passed to RNN • Context Encoder • Audio Encoder • 2 Layer GRU (Gated Recurrent Unit)

Identity Encoder & Frame Decoder • Identity Encoder • 6 Layer CNN • Produces identity encoding • Frame Decoder • 6 Layer CNN • Generates a frame of the sequence

Discriminators • Frame Discriminator • 6 Layer CNN • Is frame real or not? • Sequence Discriminator

Training Loss Formula: L1 Formula: Obtain optimal generator G* • Adam • Learning Rate • Generator: 2 * 10^-4 • Frame Discriminator: 10^-3 • Decay after epoch 20 (10% Rate) • Sequence Discriminator: 5 * 10^-5

Experiments • PyTorch • Nvidia GTX 1080 Ti • Takes a week to train • Avg. generation time: 7ms • 75 sequential frames synthesized in 0.5s • CPU • Avg. generation time: 1s • 75 sequential frames synthesized in 15s

Experiments (2) • Datasets • GRID • TCD • Increased training data by mirroring • Metrics • Generated video : PSNR & SSIM • Frame sharpness : FDBM & CPBD • Content : ACD • Accuracy spoken msg : WER

Qualitative Results • Produces realistic videos • Also works on previously unseen faces • Characteristic human expressions • Frowns • Blinks

Qualitative Results (2) • GAN-based method • L1 loss and adversarial loss • Baseline for quantitative assessment • Failures of static baseline • Opening mouth when silent • Neglecting previous face

Quantitative Results • Performance measure • GRID & TCD datasets • Compare to static baseline • 30-person survey • Turing test • 10 videos • 153 responses • Avg. 63% correct

Quiz

Future work • Different architectures • More natural sequences • Expressions are generated randomly • Natural extension • Capture mood • Reflect mood in facial expressions

Questions

End-to-End Speech-Driven Facial Animation with Temporal GANs

End-to-End Speech-Driven Facial Animation with Temporal GANs

Presentation Transcript

End to End Protocols

End-to-End Issues

End-to-End Protocols

End to End Bill Reconciliation with

End-to-End Performance with Traffic Aggregation

End to End Protocols

End-to-end Authorization

End-to-End Stewardship

End-to-End Protocols

Facial Animation with OpenGL

Realistic Performance-driven Facial Animation

HIGH END 3D PRODUKTVISUALISIERUNG & ANIMATION

High End Facial in Toronto

Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion

End-to-End Data

End-to-end eProcurement

End-to-End Protocols

End-to-End SDLC with Agile Methodologies

End to End Protocols

End To End Solutions with Everwood