210 likes | 214 Views
End-to-End Speech-Driven Facial Animation with Temporal GANs. Patrick Groot Koerkamp (6628478). High level overview. Generating videos of a talking head Audio synchronized lip movements Natural facial expressions (Blinks and Eyebrow movements) Temporal GAN.
E N D
End-to-End Speech-Driven Facial Animation with Temporal GANs Patrick Groot Koerkamp (6628478)
High level overview • Generating videos of a talking head • Audio synchronized lip movements • Natural facial expressions (Blinks and Eyebrow movements) • Temporal GAN
Generative Adversarial Networks (GAN) • Generator • Discriminator
Motivation • Simplify film animation process • Better lip-syncing • Generate parts of occluded faces • Improve band-limited visual telecommunications
Background • Generate realistic faces • Mapping Audio Features (MFCC) • Computer Graphics • Overhead • Transform Audio Features to Video Frames • Neglect facial expressions • Generate on present information • No facial dynamics • Challenging
Proposal / Contributions • GAN capable of generating videos • Audio signal • Single still image • Subject independent • No handcrafted audio • No visual feature reliance • No post processing • Comprehensive assessment • Method performance • Image quality • Lip-reading verification • Identity maintaining • Realism (Turing)
Related work • Speech-Driven Facial Animation • Acoustics, Vocal-tract, Facial motion • Hidden Markov Models (HMM) • Deep neural networks • Convolutional neural networks • GAN-Based Video Synthesis • Image/Video generation • MoCoGAN • Cross-modal applications
End-to-End Speech-Driven Facial Synthesis • 1 Generator • ReLU > TanH • 2 Discriminators • ReLU > Sigmoid
Generator • Identity Encoder • Context Encoder • Audio Encoder • RNN • Frame Decoder • Noise Generator
Audio Encoder & Context Encoder • Audio Encoder • 7 Layer CNN • Extracts 256 dimensional features • Passed to RNN • Context Encoder • Audio Encoder • 2 Layer GRU (Gated Recurrent Unit)
Identity Encoder & Frame Decoder • Identity Encoder • 6 Layer CNN • Produces identity encoding • Frame Decoder • 6 Layer CNN • Generates a frame of the sequence
Discriminators • Frame Discriminator • 6 Layer CNN • Is frame real or not? • Sequence Discriminator
Training Loss Formula: L1 Formula: Obtain optimal generator G* • Adam • Learning Rate • Generator: 2 * 10^-4 • Frame Discriminator: 10^-3 • Decay after epoch 20 (10% Rate) • Sequence Discriminator: 5 * 10^-5
Experiments • PyTorch • Nvidia GTX 1080 Ti • Takes a week to train • Avg. generation time: 7ms • 75 sequential frames synthesized in 0.5s • CPU • Avg. generation time: 1s • 75 sequential frames synthesized in 15s
Experiments (2) • Datasets • GRID • TCD • Increased training data by mirroring • Metrics • Generated video : PSNR & SSIM • Frame sharpness : FDBM & CPBD • Content : ACD • Accuracy spoken msg : WER
Qualitative Results • Produces realistic videos • Also works on previously unseen faces • Characteristic human expressions • Frowns • Blinks
Qualitative Results (2) • GAN-based method • L1 loss and adversarial loss • Baseline for quantitative assessment • Failures of static baseline • Opening mouth when silent • Neglecting previous face
Quantitative Results • Performance measure • GRID & TCD datasets • Compare to static baseline • 30-person survey • Turing test • 10 videos • 153 responses • Avg. 63% correct
Future work • Different architectures • More natural sequences • Expressions are generated randomly • Natural extension • Capture mood • Reflect mood in facial expressions