440 likes | 991 Views
Towards a VQA Suite: Architecture Tweaks, Learning Rate Schedules, and Ensembling. Tina Jiang*, Vivek Natarajan*, Xinlei Chen*, Marcus Rohrbach , Dhruv Batra, Devi Parikh. June 18 th, 2018. VQA architecture. Question Encoding. Multimodal Fusion. Classifier.
E N D
Towards a VQA Suite:Architecture Tweaks, Learning Rate Schedules, and Ensembling Tina Jiang*, Vivek Natarajan*, Xinlei Chen*, Marcus Rohrbach, Dhruv Batra, Devi Parikh June18th, 2018
VQA architecture Question Encoding Multimodal Fusion Classifier What color are the cats eyes? Green Visual Feature Extraction
VQA Baseline Architecture: CNN + LSTM Question Word embedding LSTM FC Element-wise product softmax FC Cross-entropy loss FC Image CNN Agrawal et al. 2016
2016 VQA winner: Multimodal Compact Bilinear Pooling Question Word embedding LSTM MCB softmax FC KL-DIV loss MCB conv softmax conv ReLU Image CNN Fukui et al. 2016
2017 VQA winner: Bottom-up and Top-down Attention Question Encoding Classifier Multimodal Fusion Question Word embedding GRU Gated tan Elt-wise prodcut Gated tan FC Gated tan Concate Softmax FC Gated tan + Gated tan FC Image Faster-RCNN Visual Feature Extraction Teney et al. 2017
Multi-modal Factorized Bilinear Pooling with Co-Attention Question Encoding Classifier Multimodal Fusion Question attention Question LSTM conv ReLU conv softmax MFB/MFH softmax FC softmax MFB/MFH conv conv ReLU KL-DIV loss concat softmax Image CNN Visual Feature Extraction Yu et al 2017
VQA-suite: Architecture Adaptation Question Encoding Classifier Multimodal Fusion Question attention ReLU+ norm Question LSTM Elt-wise prodcut ReLU+norm FC conv ReLU conv softmax ReLU+norm Elt-wise product Softmax FC ReLU+norm + ReLU+norm FC Image Faster-RCNN Visual Feature Extraction https://github.com/hengyuan-hu/bottom-up-attention-vqa
Architecture Adaptation • Accuracy: • Increased 1.6%
Techniques to Improve Performance • Adjust learning schedule • Finetuning image features • Data augmentation • Diversified model ensemble
Learning Schedule warm-up performance Learning rate batchsize iters Batch size: 512 Learning rate: 0.002 0.003 NAN Goyal el al. 2017
Techniques to Improve Performance • Adjust learning schedule • Accuracy: increased 0.9% • Finetuning image features • Data augmentation • Diversified model ensemble
Fine-tuning Image Feature 7x7x2048 7x7x1024 Faster-RCNN classes classes res5 average pooling box box 2048 ROI projection attributes attributes Faster-RCNN with FPN 7x7x512 FC-7 FC-6 FC, ReLU FC, ReLU 2048 2048 ROI projection
Techniques to Improve Performance • Adjust learning schedule • Accuracy: 66.91% --> 67.83% • Finetuning image features • Accuracy: increased 0.4% • Data augmentation • Diversified model ensemble
Data Augmentation: Visual Genome • 108,249 images from the intersection of MS-COCO and YFCC • Remove questions with answer not in answer space • ~ 682k questions • Repeat each answer 10 times Q: What color is the clock? A: Green Krishna et al 2016
Data Augmentation: Visual Dialog • Use COCO images • Change 10 turns of dialog to 10 questions • Repeat each answer 10 times • ~423k questions Das et al. 2017
Data Augmentation: Mirrored Image • Interchanging tokens “left” and “right” in questions and answers Q: What direction is the plane pointed? A: left A: right
Techniques to Improve Performance • Adjust learning schedule • Accuracy: 66.91% --> 67.83% • Finetuning image features • Accuracy: 67.83% --> 68.31% • Data augmentation • Faster-RCNN: 67.83% --> 68.52% • Finetune: 68.31% --> 68.86% • Diversified model ensemble
Techniques to Improve Performance • Adjust learning schedule • Accuracy: 66.91% --> 67.83% • Finetuning image features • Accuracy: 67.83% --> 68.31% • Data augmentation • Faster-RCNN: 67.83% --> 68.52% • Finetune: 68.31% --> 68.86% • Diversified model ensemble
Model Ensemble 72.23 • Strategy 1: • Best models with different seeds • Strategy 2: • Diversified models • Different training dataset • Different image features performance Same models number of models
Performance Improvement • VQA Challenge: • test-dev : 72.12 • test-standard : 72.25 • test-challenge: 72.41
Summary • Model architecture adaption, adjusting learning schedule, image fine-tune and data augmentation improved the single model performance • Diversified model can significantly improve ensemble performance • VQA-suite enabled all of these functionalities • Open source our codebase
Acknowledgments Poster Here Vivek Natarajan Dhruv Batra Devi Parikh Xinlei Chen Marcus Rohrbach Peter Anderson Abhishek Das Stefan Lee Jiasen Lu Jianwei Yang Deshraj Yadav