120 likes | 316 Views
Unsupervised Learning of Visual Sense Models for Polysemous Words. Kate Saenko Trevor Darrell Deepak. Polysemy. Ambiguity of an individual word or phrase that can be used in different contexts to express two or more meanings Eg Present: right now Present: a gift
E N D
Unsupervised Learning of Visual Sense Models for Polysemous Words Kate Saenko Trevor Darrell Deepak
Polysemy • Ambiguity of an individual word or phrase that can be used in different contexts to express two or more meanings • Eg • Present: right now • Present: a gift • Visual polysemy refers to different meanings which are visually distinct
Unsupervised learning of object classifiers suffers because of Polysemy • Existing approaches try to filter out unrelated images either through bootstrapping object classifier or clustering the images into coherent components. • But they don’t take into consideration polysemy of words • This paper proposes a unsupervised method which takes into account the word sense.
Idea behind the paper • Input: List of words and their dictionary meaning • Learn a text model of the word sense • Use that model to retrieve images of specific sense • Use these re-ranked images as training data for an object classifier • This classifier can predict the correct sense off the word related to an image
Model • Three main steps: • Discovering latent dimensions • Learning probabilistic models of dictionary sense • Using above sense models to construct sense-specific image classifiers
Latent Text Space • Bag-of-words approach using words near the image link • Bag of words approach used because this text will not follow any grammar and hence tags like POS can’t be used • Uses LDA to discover hidden topics in the data
Create a dataset of text-only webpages returned from regular web search. • LDA model is learnt on this dataset • This is done because using words directly surrounding the images causes problem of overfitting.
Dictionary Sense Model • Relate dictionary sense to topics formed in the previous step. • Dictionary sense obtained from WordNet • From the sense and the topics the likelihood of a particular sense given any topic is computed (P(s|z=j)) • Using this the probability of a particular sense in a document is computed (P(s|d)) • This gives us the sense for each image • From this images can be grouped according to sense
Visual Sense Model • Uses sense model obtained in the above two step to generate training data for an image-based classifier • Use discriminative classifier (SVM) • Re-ranks the images according the probability of that sense • From the re-ranked images selects N highest ranked examples as positive training for SVM
Datasets • Three datasets used: • Images for bass, face, mouse, speaker and watch from Yahoo Image Search • Returned images annotated as either unrelated, partial or good • Second dataset was collected using sense specific search terms • Third dataset was text-only dataset collected using regular web search
Features • Words in webpages are taken with html tags removed, tokenized, stop words removed and stemmed • For words related to images words surrounding the image link is extracted • Window size of 100 words is considered for the above step. • Image features are obtained by first resizing the images to 300 pixels converting it to grayscale and then extracting edge features and scale invariant salient points