460 likes | 580 Views
CVPR 2019 Poster. Task. Grounding referring expressions is typically formulated as a task that identifies a proposal referring to the expressions from a set of proposals in an image. Summarize. visual features of single objects. global visual contexts. CNN. pairwise visual differences.
E N D
Task Grounding referring expressions is typically formulated as a task that identifies a proposal referring to the expressions from a set of proposals in an image . Summarize visual features of single objects global visual contexts CNN pairwise visual differences objectpaircontext global language contexts LSTM language features of the decomposed phrases
Problem • existing work on global language context modeling and global visual context modeling introduces noisy information and makes it hard to match these two types of contexts • pairwise visual differences computed in existing work can only represent instance-level visual differences among objects of the same category. • existing work on context modeling for object pairs only considers first-order relationships but not multi-order relationships. • multi-order relationships are actually structured information, and the context encoders adopted by existing work on grounding referring expressions are simply incapable of modeling them.
LanguageContext wordtype wordrefertovertex vertexlanguagecontext
Language-GuidedVisualRelationGraph edge vertex
Language-VisionFeature LossFunction SemanticContextModeling
Problem • almost all the existing approaches for referring expression comprehension do not introduce reasoning or only support single-step reasoning • themodelstrainedwiththoseapproacheshavepoorinterpretability
Language-GuidedVisualReasoningProcess q: which is the concatenation of the last hidden states of both the forward and backward LSTMs
Motivation • when we feed an unseen image scene into the framework, we usually get a simple and trivial caption about the salient objects such as “there is a dog on the floor”, which is no better than just a list of object detection • once we abstract the scene into symbols, the generation will be almost disentangled from the visual perception
InductiveBias • everydaypracticemakesusperformsbetterthanmachinesinhigh-levelreasoning • template/rule-based caption models, is well-known ineffective compared to the encoder-decoder ones, due to the large gap between visual perception and language composition • Scenegraph-->bridgethegapbetweentwoworlds • we can embed the graph structure into vector representations;the vector representations are expected to transfer the inductive bias from the pure language domain to the vision-language domain
Auto-EncodingSceneGraphs Dictionary
OverallModel:SGAE-basedEncoder-Decoder • objectdetector+relationdetector+attributeclassifier • multi-modalgraphconvolutionnetwork • pre-trainDcross-entropylossRL-basedloss • twodecoders:
Motivation • unlike a visual concept in ImageNet which has 650 training images on average, a specific sentence in MS-COCO has only one single image, which is extremely scarce in the conventional view of supervised training • given a sentence pattern in Figure 1b, your descriptions for the three images in Figure 1a should be much more constrained • studies in cognitive science show that do us humans not speak an entire sentence word by word from scratch; instead, we compose a pattern first, then fill in the pattern with concepts, and we repeat this process until the whole sentence is finished
RelationModule ObjectModule FunctionModule AttributeModule
Controller Multi-stepReasoning:repeatthesoftfusionandlanguagedecodingMtimes. LinguisticLoss: