Generating Question Relevant Captions to Aid Visual Question Answering

Generating Question Relevant Captions to Aid Visual Question Answering JialinWu,ZeyuanHu,RaymondJ.Mooney DepartmentofComputerScience TheUniversityofTexasatAustin

VisualQuestionAnswering • Acomplicatedmultimodaltaskinvolvinghighlevelknowledgeofobjects,attributes,relations,etc. Does this boy have a full wetsuit on? VQA System Yes NecessaryKnowledge: Objectlevel:[person,wetsuit] Attributelevel:[wetsuit is full, person is young] Relationallevel:[wetsuitonperson]

MoreSophisticatedVisualFeatures • CNNimagefeatures • onefeaturevectorperimage • DetectingCommon Objects • Nobjectvectorsperimage • MiningAttributesonObjects • Nobjectvectorsperimage • Detectingrelations • PotentiallyN(N-1)relation vectorperimage

ATypicalCurrent Method(Andersonetal., 2018) • Usedetectedfeaturesas“bottomupattention” • Learna“topdownattention”over these high-level features Question Word Embedding GRU Does this boy have a full wetsuit on? Visualattention Answer Prediction Yes ... ... Image CNN

LeveragingTextualCaptionsforVQA • Less structural than visual features • Ayoung man is on his surf board with someone in the background • Objects:[man,surfboard,background] • Attributes:[manisyoung] • Relationships:[surferonsurfboard, manwithsomeone, someoneinbackground] • Morediversevisualconcepts • Man,youngsurfer,youngman,boy Aman on a blue surfboard on top of some rough water. Ayoung surfer in a wetsuit surfs a small wave. Ayoung man rides a surf board on a small wave while a man swims in the background. Ayoung man is on his surf board with someone in the background. Aboy riding waves on his surf board in the ocean

GeneratingRelevantCaptions • For a visualquestion, some captions are relevantand some are not. • RelevantcaptionsaremorelikelytoencodenecessaryknowledgeforVQA

CollectingRelevant Training Captions • Intuition: Generatedcaptionsshouldfocusmoreontheinfluentialobjectsforthequestion.Weuseagradientbasedmethodtomodeltheinfluence. • Wecollectthecaptionswithmaximuminner-productoftheobjects’influencescorebetweentheVQA and captioningobjectives.

HowtoUsetheCaptions • CaptionEmbedding • A Word GRU combinedwith an attentionmechanismto identify important words for the question and images. • A Caption GRU to encode sequential information from the attended words.

HowtoUsetheCaptions • Captions used toadjustvisualattention. • Similartovisualattention,usecaptionfeaturestocomputeanattentionadjustmenttoaddto theoriginalvisualattention.

ModelOverview Visualattention Captionattention adjustment ⊗,⊕denotes element-wise multiplicationandaddition.Blue arrows denote fc layers and yellow arrows denoteattention embedding.

Training • Phase1:trainVQAmodulewithhumancaptions. • Useallof5captionsfromCOCOdataset • UsethestandardbinarycrossentropylosswithsoftscoresasVQAloss • Phase2:jointlytrainVQAandcaptionmodule withselected relevant humancaptions. • Usemaximumlikelihoodestimationfortrainingcaptioningmodule • JointlyminimizingVQAlossandcaptioningloss • Phase3:Fine tuneVQAmoduleusinggeneratedcaptions.

OverallResults • VQAperformanceonVQAv2Test-standardsplit

SomeExamples

Comparison Between DifferentCaptions • HumancaptionshelpVQAthe most. • GeneratedquestionrelevantcaptionshelpVQAmorethanquestion-agnosticcaptions.

EffectsofAdjustingVisualAttention • BetterVQAperformance • Morehuman-likeattention

Conclusions • General captions, even generated ones, can be used to improve VQA. • Question relevant captions improve VQA even more.

Generating Question Relevant Captions to Aid Visual Question Answering

Generating Question Relevant Captions to Aid Visual Question Answering

Presentation Transcript

Question-Answering

Question Answering

Answering Question 1

Question AnswerinG

Question-Answering: Overview

Question Answering Tutorial

Question Answering Technologies

Question Answering Tutorial

Question Answering

Question Answering

Question Answering

Question Answering

Question-Answering: Overview

Automatic Question Answering

Question Answering

Question Answering

Question Answering

Question Answering via Question-to-Question Mapping