1 / 18

Generating Question Relevant Captions to Aid Visual Question Answering

This research paper introduces a method for generating question-relevant captions to enhance visual question answering systems. The method leverages textual captions to adjust visual attention in order to improve VQA performance. Experimental results demonstrate the effectiveness of question-relevant captions in improving VQA accuracy.

renej
Download Presentation

Generating Question Relevant Captions to Aid Visual Question Answering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Generating Question Relevant Captions to Aid Visual Question Answering JialinWu,ZeyuanHu,RaymondJ.Mooney DepartmentofComputerScience TheUniversityofTexasatAustin

  2. VisualQuestionAnswering • Acomplicatedmultimodaltaskinvolvinghighlevelknowledgeofobjects,attributes,relations,etc. Does this boy have a full wetsuit on? VQA System Yes NecessaryKnowledge: Objectlevel:[person,wetsuit] Attributelevel:[wetsuit is full, person is young] Relationallevel:[wetsuitonperson]

  3. MoreSophisticatedVisualFeatures • CNNimagefeatures • onefeaturevectorperimage • DetectingCommon Objects • Nobjectvectorsperimage • MiningAttributesonObjects • Nobjectvectorsperimage • Detectingrelations • PotentiallyN(N-1)relation vectorperimage

  4. ATypicalCurrent Method(Andersonetal., 2018) • Usedetectedfeaturesas“bottomupattention” • Learna“topdownattention”over these high-level features Question Word Embedding GRU Does this boy have a full wetsuit on? Visualattention Answer Prediction Yes ... ... Image CNN

  5. LeveragingTextualCaptionsforVQA • Less structural than visual features • Ayoung man is on his surf board with someone in the background • Objects:[man,surfboard,background] • Attributes:[manisyoung] • Relationships:[surferonsurfboard, manwithsomeone, someoneinbackground] • Morediversevisualconcepts • Man,youngsurfer,youngman,boy Aman on a blue surfboard on top of some rough water. Ayoung surfer in a wetsuit surfs a small wave. Ayoung man rides a surf board on a small wave while a man swims in the background. Ayoung man is on his surf board with someone in the background. Aboy riding waves on his surf board in the ocean

  6. GeneratingRelevantCaptions • For a visualquestion, some captions are relevantand some are not. • RelevantcaptionsaremorelikelytoencodenecessaryknowledgeforVQA

  7. CollectingRelevant Training Captions • Intuition: Generatedcaptionsshouldfocusmoreontheinfluentialobjectsforthequestion.Weuseagradientbasedmethodtomodeltheinfluence. • Wecollectthecaptionswithmaximuminner-productoftheobjects’influencescorebetweentheVQA and captioningobjectives.

  8. HowtoUsetheCaptions • CaptionEmbedding • A Word GRU combinedwith an attentionmechanismto identify important words for the question and images. • A Caption GRU to encode sequential information from the attended words.

  9. HowtoUsetheCaptions • Captions used toadjustvisualattention. • Similartovisualattention,usecaptionfeaturestocomputeanattentionadjustmenttoaddto theoriginalvisualattention.

  10. ModelOverview Visualattention Captionattention adjustment ⊗,⊕denotes element-wise multiplicationandaddition.Blue arrows denote fc layers and yellow arrows denoteattention embedding.

  11. Training • Phase1:trainVQAmodulewithhumancaptions. • Useallof5captionsfromCOCOdataset • UsethestandardbinarycrossentropylosswithsoftscoresasVQAloss • Phase2:jointlytrainVQAandcaptionmodule withselected relevant humancaptions. • Usemaximumlikelihoodestimationfortrainingcaptioningmodule • JointlyminimizingVQAlossandcaptioningloss • Phase3:Fine tuneVQAmoduleusinggeneratedcaptions.

  12. OverallResults • VQAperformanceonVQAv2Test-standardsplit

  13. SomeExamples

  14. SomeExamples

  15. SomeExamples

  16. Comparison Between DifferentCaptions • HumancaptionshelpVQAthe most. • GeneratedquestionrelevantcaptionshelpVQAmorethanquestion-agnosticcaptions.

  17. EffectsofAdjustingVisualAttention • BetterVQAperformance • Morehuman-likeattention

  18. Conclusions • General captions, even generated ones, can be used to improve VQA. • Question relevant captions improve VQA even more.

More Related