Learning Semantics of Words and Pictures

Learning Semantics of Words and Pictures Tejaswi Devarapalli

Content • Introduction • Modeling Image dataseTStatistics • Hierarchical Model • Testing and Using Basic Model • Auto Illustration • Auto Annotation • Results • Discussions

Semantics • Language uses a system of linguistic signs, each of which is a combination of meaning and phonological and/or orthographic forms. • Semantics is traditionally defined as the study of meaning in language.

Abstract • A statistical model for organizing image collections. • Integrates semantic information provided by associated text and visual information provided by image features. • Promising model for Information retrieval tasks like database browsing, searching for images. • Used for Novel applications.

Introduction • Method for organizing image databases. • Integrates two kinds of information during model construction. • Learns links between image features and semantics. • Learnings useful in • Better browsing • Better search • Novel applications

Introduction(continued) • Models statistics about occurrence and co-occurrence of word and features. • Hierarchical structure. • Generative model, implicitly contains processes for predicting • Image components • Words and features

comparison • This model supports browsing for the image retrieval purposes • Systems for searching image databases includes search by query. • Text • Image feature similarity • Segment features • Image sketch

Modeling Image Dataset Statistics • Generative Hierarchical model • Combination of • Asymmetric clustering model (maps documents into clusters) • Symmetric clustering model(models joint distribution of documents and features). • Data modeled as fixed hierarchy of nodes. • Nodes generate word image segment

Illustration • Documents modeled as sequence of words and sequence of segments using blobworld representation. • "Blobworld" representation is created by clustering pixels in a joint color-texture-position feature space. • The document is modeled by sum over the clusters, taking all clusters into consideration.

Hierarchical Model Higher level nodes emit more general words and blobs. (e. g . sky) Moderately general words and blobs. (e. g . Sun,sea) • Each Node has a probability of generating a word/ image w.r.t the document under consideration. • Cluster defines the path. • Cluster, Level identifies the node. Lower level nodes emit more specific words and blobs. (e. g . Waves) Sun Sky Sea Waves

Mathematical Process for generating set of observations ‘D’ associated with a document ‘d’ is described by C – clusters, i – items, l– levels.

Gaussian Distributions • Number of features like aspects of size, position, color, texture and shape all together form feature vector ‘X’. • Probability distribution over image segments by usual formula:-

Modeling image dataset statistics • This model uses Hierarchical model as it best supports • Browsing of large collections of images • Compact representation • Provides implementation details for avoiding over training. • The training procedure clusters a few thousand images in a few hours on a state of the art pc.

Modeling image dataset statistics • Resource requirements like “memory” increase rapidly with no.ofimages. So we need extra care. • There are different approaches for avoiding over-training and resource usage.

First approach • We train on randomly selected subset of images until log likelyhood for held out data, randomly selected from remaining data begins to drop. • The model so found is used as a starting point for next training round using second random set of images.

Second Approach • Second method for reducing resource usage is to Limit cluster membership. • First compute approximate clustering by training on a subset. • Then cluster on entire dataset, maintain probability that a point is in a cluster for top twenty clusters. • Rest of the membership probabilities assumed to be zero for next few iterations.

Testing and using Basic model • Method stability is tested by running fitting process. • Fitting process is run on same data several times with different initial conditions as Expectation maximization(EM) process is sensitive to the starting point. • The clustering point depends more on starting point than on exact images chosen for training. • The second test is to verify whether clustering on both image and text has advantage or not.

Testing and Using the Basic Model This figure shows 16 images from a cluster found using text only

Testing and Using the Basic Model This figure shows 16 images from a cluster found using only image features

Testing and Using the Basic Model

Browsing • Most image retrieval systems do not support browsing. • They force user to specify a Query. • The issue is whether the clusters found through browsing make sense to the user. • If the user finds the clusters coherent then they can begin to internalize the kind of structure they represent.

Browsing • User study • Generate 64 clusters for 3000 clusters. • Generate 64 random clusters from the same images. • Present random cluster to user, ask to rate coherence(yes/no). • 94% accuracy

Image Search • Supply a combination of text and image features. • Approach : Compute for each Candidate image, the probability of emitting the query items. • Q = set of query items d= candidate document.

Image Search The figure shows the results of the “river” and “tiger” query.

Image search • Second approach • Finding the probability that each cluster generates a query and then sample according to weighted clusters. • Cluster membership plays important role in generating documents, we can say clusters are coherent.

Image Search • Providing more flexible method of specifying image features is an important next step. • This is as explored in many “query by example” image retrieval systems. Example :- we can query for a dog with word DOG and if we want blue sky then we can get it by adding image segment feature to the query.

Pictures from Words and Words from Pictures • There are two types of approaches for linking words to pictures and pictures to words. • Auto Illustration • Auto Annotation

Auto Illustration • “Auto illustration” – the process of linking pictures to words. • Given a set of query items, Q and a candidate document d, we can express the probability that a document produces the query by:

Auto Annotation • Generate words for a given image • Consider the probability of the image belonging to the current cluster. • Consider the probability of the items in the image being generated by the nodes at various levels in the path associated to the cluster. • Work the above out for all clusters.

Auto annotation • We are computing the probability that an image emits a proposed word, given the observed segments, B:

AutoAnnotation The figure shows some annotation results showing the original image, the Blobworld segmentation, the corel keywords, and the predicted words in rank order.

Auto Annotation • The test images were not in the training set, but they come from same set of CD’s used for training. • The Keywords in upper-case are in the vocabulary.

Auto Annotation • Testing the Annotation procedure: • We use the model to predict the image words based only on the segments, then compare the words with segments. • Perform test on Training data and two different test sets. They are 1st Set - Randomly selected held out set from proposed training data coming from Corel CD’s. 2nd Set - Images from other CD’s

Auto Annotation • Quantitative performance • Use 160 Corel CD’s , each with 100 images(grouped by theme) • Select 80 of the CDs, split into training (75%) and test (25%). • Remaining 80 CDs are a ‘harder’ test set. Model scoring: n = number of words for the image , r= number of words rectly.

Results Annotation results on three kinds of test data, with three different scoring methods.

Results • The above table summarizes the annotation result using the three scoring methods and the three held out sets. • We average the results of 5 separate runs with different held out sets. • Using the comparison of sampling from the word prior , We score 3.14 on the training data, 2.70 on non-training data from the same CD set as the training data and 1.65 for test Data taken from completely different set of CD’s.

Discussion • Performance of the system can be measured by taking advantage of its predictive capabilities. • Words with no relevance to visual content cause random noise, by taking away probability from more relevant words. • Such words can be removed by observing their emission probabilities are spread out over the nodes. • This is automatic image reduction method works depending on the nature of the data set.

References • Learning Semantics of words and Pictures by Kobus Barnard and David Forsyth, Computer Division, University of California, Berkeley http://www.wisdom.weizmann.ac.il/~vision/courses/2003_2/barnard00learning.pdf • C.Carson, S.Belonge, H. Greenspan and J.Malik, “Blobworld: Image segmentation using Expectation Maximization and its application to image querying”, in review. http://www.cs.berkeley.edu/~malik/papers/CBGM-blobworld.pdf

Queries

Thank you

Learning Semantics of Words and Pictures