330 likes | 344 Views
This paper explores how captions can be used to enhance Visual Question Answering (VQA) systems. It proposes a joint VQA and captioning model that generates question-relevant captions to improve VQA performance. Experimental results show that the generated relevant captions significantly improve VQA compared to question-agnostic captions.
E N D
Jointly Generating Captions to Aid Visual Question Answering Raymond Mooney Department of Computer Science University of Texas at Austin with Jialin Wu
VQA ImagecreditstoVQAwebsite
VQA Architectures • Most systems are DNNs using both CNNs and RNNs.
VQA with BUTD • We use a recent state-of-the-art VQA system BUTD (Bottom-Up-Top-Down) (Anderson et al. 2018). • BUTD first detects a wide range of objects and attributes trained on VisualGenome data, and attends to them when computing an answer.
Using Visual Segmentations • We use recent methods for using detailed image segmentations for VQA (VQS, Ganet al., 2017). • Provides more precise visual information than BUTD’s bounding boxes.
How can captions help VQA? • Captions+Detectionsasinputs • CaptionscanprovideusefulinformationfortheVQAmodel
Multitask VQA and Image Captioning • There are lots of datasets with image captions. • COCO data used in VQA comes with captions • Captioning and VQA both need knowledge of image content and language. • Should benefit from multitask learning (Caruana, 1997).
Question relevant captions • For a particular question, some of the captions are relevant and some are not.
Howtogeneratequestion-relevantcaptions • Input feature side • We need to bias the features to encode the necessary information for the questions. • We used the VQA joint representation for simplicity. • Supervision side • We need the relevant captions to train the model to generate the relevant captions.
How to obtain relevant training captions • Directly Collecting captions for each question? • Over 1.1 million questions in the dataset(notscalable). • ThecaptionhastobeinlinewiththeVQAreasoningprocess. • Choosing the most relevant caption from existing dataset? • How to measure relevance? • What if there is no relevant caption for an image-question pair?
Quantifying the relevance • Intuition • Generating relevance captions should share the optimization goal with answering the visual question. • The two objectives should share some descent directions. • Relevance is measured using the inner-product of the gradients from the caption generation loss and the VQA answer prediction loss. • A positive inner-product means the two objective functions share some descent directions in the optimization process, and therefore indicates that the corresponding captions help the VQA process.
Quantifying the relevance • Selecting the most relevant human caption
Howtousethecaptions • A Word GRU to identify important words for the question and images • A Caption GRU to encode the sequential information from the attended words.
VQA 2.0 Data • Training • 443,757 questions • 82,783 images • Validation • 214,354 questions • 40,504 images • Test • 447,793 questions • 81,434 images All images come with 5 human generated captions
Experimental Results • Compare with the state-of-the-art
Experimental Results • Comparing different types of captions • Generated relevant captions help VQA more than the question-agnostic captions from BUTD.
ImprovingImageCaptioningUsingan Image-ConditionedAuto-Encoder
Aiding Training by Using an Easier Task • Using an easier task that first encodes the human captions and the image, and then generates the caption back. C1: several doughnuts are in a cardboard box. C2: a box holds four pairs of mini doughnuts. C3: a variety of doughnuts sit in a box.C4: several different donuts are placed in the box.C5: a fresh box of twelve assorted glazed pastries. C1: several doughnuts are in a cardboard box. C2: a box holds four pairs of mini doughnuts. C3: a variety of doughnuts sit in a box.C4: several different donuts are placed in the box.C5: a fresh box of twelve assorted glazed pastries. ENC DEC
TrainingforImageCaptioning • Maximum likelihood principle • REINFORCE algorithms
Hidden State Supervision • Both of these training approaches provide supervision on the output word probabilities,thereforethe hidden states do not receive direct supervision. • Supervising the hidden states requires the oracle hidden states that contain richer information. • An easier task that first encodes the human captions and the image, and then generates the caption back can help. • Hidden state loss for time (t)
Training with Maximum Likelihood • Jointly optimizes the log-likelihood and the hidden states loss at each time step (t)
Training withREINFORCE • Objectives • Gradients • Problem • Every word receives the same amount of reward no matter how appropriate they are.
Hidden State Loss as a Reward Bias • Motivation • A word should have more reward when its hidden state matches a high performance oracle encoder. • Reward bias
Experimental Data • COCO (Chen et al., 2015) • Each image with 5 human captions • “Karpathy split” • 110,000 training images • 5,000 validation images • 5,000 test images
Baseline Systems • FC (Rennie et al., 2017) • With and without “self critical sequence training” • Up-Down (aka BUTD) (Anderson et al., 2018) • With and without “self critical sequence training”
Evaluation Metrics • BLEU-4 (B-4) • METEOR (M) • ROUGE-L (R-L) • CIDEr (C) • SPICE (S)
Experimental Results for REINFORCE • Training with different reward metrics
Conclusions • Jointly generating “question relevant” captions can improve Visual Question Answering. • First training an image-conditioned caption auto-encoder can help supervise a captioner to create better hidden state representations that improve final captioning performance.