400 likes | 531 Views
Learning Semantics of Words and Pictures. Tejaswi D evarapalli. C ontent. Introduction Modeling Image dataseT Statistics Hierarchical Model Testing and Using Basic Model Auto Illustration Auto Annotation Results Discussions. Semantics.
E N D
Learning Semantics of Words and Pictures Tejaswi Devarapalli
Content • Introduction • Modeling Image dataseTStatistics • Hierarchical Model • Testing and Using Basic Model • Auto Illustration • Auto Annotation • Results • Discussions
Semantics • Language uses a system of linguistic signs, each of which is a combination of meaning and phonological and/or orthographic forms. • Semantics is traditionally defined as the study of meaning in language.
Abstract • A statistical model for organizing image collections. • Integrates semantic information provided by associated text and visual information provided by image features. • Promising model for Information retrieval tasks like database browsing, searching for images. • Used for Novel applications.
Introduction • Method for organizing image databases. • Integrates two kinds of information during model construction. • Learns links between image features and semantics. • Learnings useful in • Better browsing • Better search • Novel applications
Introduction(continued) • Models statistics about occurrence and co-occurrence of word and features. • Hierarchical structure. • Generative model, implicitly contains processes for predicting • Image components • Words and features
comparison • This model supports browsing for the image retrieval purposes • Systems for searching image databases includes search by query. • Text • Image feature similarity • Segment features • Image sketch
Modeling Image Dataset Statistics • Generative Hierarchical model • Combination of • Asymmetric clustering model (maps documents into clusters) • Symmetric clustering model(models joint distribution of documents and features). • Data modeled as fixed hierarchy of nodes. • Nodes generate word image segment
Illustration • Documents modeled as sequence of words and sequence of segments using blobworld representation. • "Blobworld" representation is created by clustering pixels in a joint color-texture-position feature space. • The document is modeled by sum over the clusters, taking all clusters into consideration.
Hierarchical Model Higher level nodes emit more general words and blobs. (e. g . sky) Moderately general words and blobs. (e. g . Sun,sea) • Each Node has a probability of generating a word/ image w.r.t the document under consideration. • Cluster defines the path. • Cluster, Level identifies the node. Lower level nodes emit more specific words and blobs. (e. g . Waves) Sun Sky Sea Waves
Mathematical Process for generating set of observations ‘D’ associated with a document ‘d’ is described by C – clusters, i – items, l– levels.
Gaussian Distributions • Number of features like aspects of size, position, color, texture and shape all together form feature vector ‘X’. • Probability distribution over image segments by usual formula:-
Modeling image dataset statistics • This model uses Hierarchical model as it best supports • Browsing of large collections of images • Compact representation • Provides implementation details for avoiding over training. • The training procedure clusters a few thousand images in a few hours on a state of the art pc.
Modeling image dataset statistics • Resource requirements like “memory” increase rapidly with no.ofimages. So we need extra care. • There are different approaches for avoiding over-training and resource usage.
First approach • We train on randomly selected subset of images until log likelyhood for held out data, randomly selected from remaining data begins to drop. • The model so found is used as a starting point for next training round using second random set of images.
Second Approach • Second method for reducing resource usage is to Limit cluster membership. • First compute approximate clustering by training on a subset. • Then cluster on entire dataset, maintain probability that a point is in a cluster for top twenty clusters. • Rest of the membership probabilities assumed to be zero for next few iterations.
Testing and using Basic model • Method stability is tested by running fitting process. • Fitting process is run on same data several times with different initial conditions as Expectation maximization(EM) process is sensitive to the starting point. • The clustering point depends more on starting point than on exact images chosen for training. • The second test is to verify whether clustering on both image and text has advantage or not.
Testing and Using the Basic Model This figure shows 16 images from a cluster found using text only
Testing and Using the Basic Model This figure shows 16 images from a cluster found using only image features
Browsing • Most image retrieval systems do not support browsing. • They force user to specify a Query. • The issue is whether the clusters found through browsing make sense to the user. • If the user finds the clusters coherent then they can begin to internalize the kind of structure they represent.
Browsing • User study • Generate 64 clusters for 3000 clusters. • Generate 64 random clusters from the same images. • Present random cluster to user, ask to rate coherence(yes/no). • 94% accuracy
Image Search • Supply a combination of text and image features. • Approach : Compute for each Candidate image, the probability of emitting the query items. • Q = set of query items d= candidate document.
Image Search The figure shows the results of the “river” and “tiger” query.
Image search • Second approach • Finding the probability that each cluster generates a query and then sample according to weighted clusters. • Cluster membership plays important role in generating documents, we can say clusters are coherent.
Image Search • Providing more flexible method of specifying image features is an important next step. • This is as explored in many “query by example” image retrieval systems. Example :- we can query for a dog with word DOG and if we want blue sky then we can get it by adding image segment feature to the query.
Pictures from Words and Words from Pictures • There are two types of approaches for linking words to pictures and pictures to words. • Auto Illustration • Auto Annotation
Auto Illustration • “Auto illustration” – the process of linking pictures to words. • Given a set of query items, Q and a candidate document d, we can express the probability that a document produces the query by:
Auto Annotation • Generate words for a given image • Consider the probability of the image belonging to the current cluster. • Consider the probability of the items in the image being generated by the nodes at various levels in the path associated to the cluster. • Work the above out for all clusters.
Auto annotation • We are computing the probability that an image emits a proposed word, given the observed segments, B:
AutoAnnotation The figure shows some annotation results showing the original image, the Blobworld segmentation, the corel keywords, and the predicted words in rank order.
Auto Annotation • The test images were not in the training set, but they come from same set of CD’s used for training. • The Keywords in upper-case are in the vocabulary.
Auto Annotation • Testing the Annotation procedure: • We use the model to predict the image words based only on the segments, then compare the words with segments. • Perform test on Training data and two different test sets. They are 1st Set - Randomly selected held out set from proposed training data coming from Corel CD’s. 2nd Set - Images from other CD’s
Auto Annotation • Quantitative performance • Use 160 Corel CD’s , each with 100 images(grouped by theme) • Select 80 of the CDs, split into training (75%) and test (25%). • Remaining 80 CDs are a ‘harder’ test set. Model scoring: n = number of words for the image , r= number of words rectly.
Results Annotation results on three kinds of test data, with three different scoring methods.
Results • The above table summarizes the annotation result using the three scoring methods and the three held out sets. • We average the results of 5 separate runs with different held out sets. • Using the comparison of sampling from the word prior , We score 3.14 on the training data, 2.70 on non-training data from the same CD set as the training data and 1.65 for test Data taken from completely different set of CD’s.
Discussion • Performance of the system can be measured by taking advantage of its predictive capabilities. • Words with no relevance to visual content cause random noise, by taking away probability from more relevant words. • Such words can be removed by observing their emission probabilities are spread out over the nodes. • This is automatic image reduction method works depending on the nature of the data set.
References • Learning Semantics of words and Pictures by Kobus Barnard and David Forsyth, Computer Division, University of California, Berkeley http://www.wisdom.weizmann.ac.il/~vision/courses/2003_2/barnard00learning.pdf • C.Carson, S.Belonge, H. Greenspan and J.Malik, “Blobworld: Image segmentation using Expectation Maximization and its application to image querying”, in review. http://www.cs.berkeley.edu/~malik/papers/CBGM-blobworld.pdf