170 likes | 458 Views
How Many Words is a Picture Worth? Automatic Caption Generation for News Images. Yansong Feng and Mirella Lapata. Ashish Bagate. What this paper is about. Explore the feasibility of automatic caption generation for images in news domain
E N D
How Many Words is a Picture Worth?Automatic Caption Generation for News Images YansongFeng and MirellaLapata AshishBagate
What this paper is about • Explore the feasibility of automatic caption generation for images in news domain • Why particularly news domain – training data is available easily and abundantly
Why • Lots of digital images available on the Web • Improved searching • Analysis of the image • Keywords only searches are ambiguous • Targeted queries using longer search strings • Web accessibility
General Approach • Two step process • Analyze the image and build a representation for the same • Run the text generation engine on the image representation, and come up with a natural language representation
Related Work • Hede et al. – not practical because of controlled data set and also manual database creation • Yao et al. – based on just the image • Elzer et al. – what the graphic depicts, little emphasis on graphics generation • These methods use some background information /terminologies
Problem Formulation • For the given image I and the document D, generate a caption C • Training data contains document – image – caption tuples • Caption generation is a difficult task even for humans • A good caption must be succinct, informative, clearly identify the subject of the picture, draw reader to the article
Overview of the method • Similar to Headline generation task • Get the training data (it would be noisy) • Follows two stage approach • Get the keywords from the image (image annotation model) • Generate the caption from the given image words • Use of image features for faithful and meaningful description for the images
Image Annotation • Probabilistic model – well suited for noisy data • Calculate SIFT descriptors of images • Visual words by K means clustering • Get the keywords by LDA • dmix - bag of words representing image – document – caption
Extractive Caption Generation • Not much linguistic analysis is needed • Caption would be a sentence from the document which is maximally similar to description keywords
Types of Similarities • Word Overlap • Cosine Similarity • Probabilistic Similarity • KL divergence – similarity between an image and a sentence is measured by the extent to which they share the same topic distributions
Issues with Extractive Caption Generation • No single sentence can represent the image • Selected caption sentences might be longer than the average length of the sentence • May not be catchy
Abstractive Caption Generation • Word based model • Adapted from headline generation • Caption = the sequence of words that maximizes P
Abstractive Caption Generation • Phrase based model • Caption = the sequence of words that maximizes P