Unsupervised Learning of Visual Sense Models for Polysemous Words

Unsupervised Learning of Visual Sense Models for Polysemous Words Kate Saenko Trevor Darrell Deepak

Polysemy • Ambiguity of an individual word or phrase that can be used in different contexts to express two or more meanings • Eg • Present: right now • Present: a gift • Visual polysemy refers to different meanings which are visually distinct

Eg: Mouse

Unsupervised learning of object classifiers suffers because of Polysemy • Existing approaches try to filter out unrelated images either through bootstrapping object classifier or clustering the images into coherent components. • But they don’t take into consideration polysemy of words • This paper proposes a unsupervised method which takes into account the word sense.

Idea behind the paper • Input: List of words and their dictionary meaning • Learn a text model of the word sense • Use that model to retrieve images of specific sense • Use these re-ranked images as training data for an object classifier • This classifier can predict the correct sense off the word related to an image

Model • Three main steps: • Discovering latent dimensions • Learning probabilistic models of dictionary sense • Using above sense models to construct sense-specific image classifiers

Latent Text Space • Bag-of-words approach using words near the image link • Bag of words approach used because this text will not follow any grammar and hence tags like POS can’t be used • Uses LDA to discover hidden topics in the data

Create a dataset of text-only webpages returned from regular web search. • LDA model is learnt on this dataset • This is done because using words directly surrounding the images causes problem of overfitting.

Dictionary Sense Model • Relate dictionary sense to topics formed in the previous step. • Dictionary sense obtained from WordNet • From the sense and the topics the likelihood of a particular sense given any topic is computed (P(s|z=j)) • Using this the probability of a particular sense in a document is computed (P(s|d)) • This gives us the sense for each image • From this images can be grouped according to sense

Visual Sense Model • Uses sense model obtained in the above two step to generate training data for an image-based classifier • Use discriminative classifier (SVM) • Re-ranks the images according the probability of that sense • From the re-ranked images selects N highest ranked examples as positive training for SVM

Datasets • Three datasets used: • Images for bass, face, mouse, speaker and watch from Yahoo Image Search • Returned images annotated as either unrelated, partial or good • Second dataset was collected using sense specific search terms • Third dataset was text-only dataset collected using regular web search

Features • Words in webpages are taken with html tags removed, tokenized, stop words removed and stemmed • For words related to images words surrounding the image link is extracted • Window size of 100 words is considered for the above step. • Image features are obtained by first resizing the images to 300 pixels converting it to grayscale and then extracting edge features and scale invariant salient points

Unsupervised Learning of Visual Sense Models for Polysemous Words

Unsupervised Learning of Visual Sense Models for Polysemous Words

Presentation Transcript

Unsupervised Learning

Unsupervised Learning of Finite Mixture Models

Unsupervised Learning

Unsupervised Learning

Unsupervised Learning of Visual Taxonomies

Unsupervised learning

Unsupervised Learning

Unsupervised Learning

Unsupervised Word Sense Disambiguation

Unsupervised Learning

Unsupervised Learning of Visual Object Categories

Unsupervised Learning

Unsupervised learning

Recursive Unsupervised Learning of Finite Mixture Models

Unsupervised Learning

Unsupervised Learning

Unsupervised learning

Unsupervised Learning

Unsupervised Learning

Unsupervised Learning