GuessWhat ?! Visual object discovery through multi-modal dialogue

GuessWhat?! Visual object discovery through multi-modal dialogue Harm de Vries , Florian Strub, SarathChandar, Olivier Pietquin, Hugo Larochelle, Aaron Courville

Motivation • Learn to acquire natural language by interaction on a visual task • First large-scale dataset involving images and dialogue. • Requires high-level image understanding, like spatial reasoning and language grounding

Oracle Model

Guesser Model

Question Generator

Critique -All information about image is useless in the guesser model -Using two trained models to evaluate Question Generator -Access to object list -When to guess? -More baselines -Unseen object categories -Unrealistic task

GuessWhat ?! Visual object discovery through multi-modal dialogue