180 likes | 192 Views
This research paper introduces a method for generating question-relevant captions to enhance visual question answering systems. The method leverages textual captions to adjust visual attention in order to improve VQA performance. Experimental results demonstrate the effectiveness of question-relevant captions in improving VQA accuracy.
E N D
Generating Question Relevant Captions to Aid Visual Question Answering JialinWu,ZeyuanHu,RaymondJ.Mooney DepartmentofComputerScience TheUniversityofTexasatAustin
VisualQuestionAnswering • Acomplicatedmultimodaltaskinvolvinghighlevelknowledgeofobjects,attributes,relations,etc. Does this boy have a full wetsuit on? VQA System Yes NecessaryKnowledge: Objectlevel:[person,wetsuit] Attributelevel:[wetsuit is full, person is young] Relationallevel:[wetsuitonperson]
MoreSophisticatedVisualFeatures • CNNimagefeatures • onefeaturevectorperimage • DetectingCommon Objects • Nobjectvectorsperimage • MiningAttributesonObjects • Nobjectvectorsperimage • Detectingrelations • PotentiallyN(N-1)relation vectorperimage
ATypicalCurrent Method(Andersonetal., 2018) • Usedetectedfeaturesas“bottomupattention” • Learna“topdownattention”over these high-level features Question Word Embedding GRU Does this boy have a full wetsuit on? Visualattention Answer Prediction Yes ... ... Image CNN
LeveragingTextualCaptionsforVQA • Less structural than visual features • Ayoung man is on his surf board with someone in the background • Objects:[man,surfboard,background] • Attributes:[manisyoung] • Relationships:[surferonsurfboard, manwithsomeone, someoneinbackground] • Morediversevisualconcepts • Man,youngsurfer,youngman,boy Aman on a blue surfboard on top of some rough water. Ayoung surfer in a wetsuit surfs a small wave. Ayoung man rides a surf board on a small wave while a man swims in the background. Ayoung man is on his surf board with someone in the background. Aboy riding waves on his surf board in the ocean
GeneratingRelevantCaptions • For a visualquestion, some captions are relevantand some are not. • RelevantcaptionsaremorelikelytoencodenecessaryknowledgeforVQA
CollectingRelevant Training Captions • Intuition: Generatedcaptionsshouldfocusmoreontheinfluentialobjectsforthequestion.Weuseagradientbasedmethodtomodeltheinfluence. • Wecollectthecaptionswithmaximuminner-productoftheobjects’influencescorebetweentheVQA and captioningobjectives.
HowtoUsetheCaptions • CaptionEmbedding • A Word GRU combinedwith an attentionmechanismto identify important words for the question and images. • A Caption GRU to encode sequential information from the attended words.
HowtoUsetheCaptions • Captions used toadjustvisualattention. • Similartovisualattention,usecaptionfeaturestocomputeanattentionadjustmenttoaddto theoriginalvisualattention.
ModelOverview Visualattention Captionattention adjustment ⊗,⊕denotes element-wise multiplicationandaddition.Blue arrows denote fc layers and yellow arrows denoteattention embedding.
Training • Phase1:trainVQAmodulewithhumancaptions. • Useallof5captionsfromCOCOdataset • UsethestandardbinarycrossentropylosswithsoftscoresasVQAloss • Phase2:jointlytrainVQAandcaptionmodule withselected relevant humancaptions. • Usemaximumlikelihoodestimationfortrainingcaptioningmodule • JointlyminimizingVQAlossandcaptioningloss • Phase3:Fine tuneVQAmoduleusinggeneratedcaptions.
OverallResults • VQAperformanceonVQAv2Test-standardsplit
Comparison Between DifferentCaptions • HumancaptionshelpVQAthe most. • GeneratedquestionrelevantcaptionshelpVQAmorethanquestion-agnosticcaptions.
EffectsofAdjustingVisualAttention • BetterVQAperformance • Morehuman-likeattention
Conclusions • General captions, even generated ones, can be used to improve VQA. • Question relevant captions improve VQA even more.