290 likes | 310 Views
Matching Words and Pictures. Rose Zhang Ryan Westerman. MOTIVATION. Why do we care?. Users make requests based on image semantics, but most technology at this time fails to categorize based on objects in images Semantics of images are requested in different ways
E N D
Matching Words and Pictures Rose Zhang Ryan Westerman
Why do we care? • Users make requests based on image semantics, but most technology at this time fails to categorize based on objects in images • Semantics of images are requested in different ways • Request object kind (princess) and identities (Princess of Wales) • Request images by thing that are visible and what images are about • Users don’t really care about histograms or textures • Useful in practice like newspaper archiving
Proposed Applications • Automated Image Annotation • Allows categorization of images in large image archives • Browsing Support • Facilitates organizing collections of similar images for easier browsing • Auto-Illustration • Automatically provide an image based on descriptive text
Hierarchical Aspect Model • A generative approach to clustering documents • Clusters appear as boxes at the bottom of this figure. • A path from root to leaf above a cluster signifies the words most likely to be found in a document belonging to that cluster. • The words at the leaf node are likely unique to these documents, compared to words at the root node which are shared across all clusters. Image from T. Hofmann, Learning and representing topic. A hierarchical mixture model for word occurrence in document databases.
Multi-Modal Hierarchical Aspect Model • Generates words to cluster images instead of documents • Higher level nodes emit more generally applicable words • Clusters represent groupings of annotations for images • Model is trained using Expectation Maximization
Generating words from pictures c=cluster indices, w = words in document(image) d, b=image region indices in d, l=abstraction level, D=set of observations for d, B=set of blobs for d, W=set of words for d, where D=B U W, exponents normalize the differing number of words and blobs in each image. • This model generates a set of observations D, associated with document d • Relies on documents specific to the training set • Good for search, bad for documents not in training set • Would have to refit model for new document
What about documents not in the training set? • This model is generative and doesn’t depend on d in the equation. • Replaces d with c which doesn’t significantly decrease quality of results • Makes equation simpler: compute a cluster dependent average during training rather than calculating for each document • Saves memory for large number of documents
Image based word prediction • Assume new document with blobs, B • We are applying this to documents outside the training set, so this equation is based on the equation on the previous slide
Multi-Modal Dirichlet Allocation Process • Choose one of J mixture components c ∼ Multinomial(η). • Conditioned on c, choose a mixture over J factors, θ ∼ Dir(αc). • For each of the N words: • Choose one of K factors zn ∼ Multinomial(θ). • Choose one of V words wn from probability of wn given zn and c • For each of the M blobs: • Choose a factor sm ∼ Multinomial(θ). • Choose a blob bm from a Gaussian distribution conditioned on sm and c.
Predictions using MoM-LDA • Given an image and a MoM-LDA, we can • Compute an approximate posterior over mixture components, φ • Compute an approximate Dirichlet over factors, γ • Using the following formula, we calculate the distribution over words given an image J represents all mixture components, and K represents all factors
Simple Correspondence Models remember: • Predict words for specific regions instead of entire image • Discrete-translation: match word to blob using joint probability table • Hierarchical Clustering • Use the hierarchical model, but for blobs instead of images • See equation • But, discrete-translation purposely ignores potentially useful training data and hierarchical clustering uses data that the model was not trained to represent
Integrated Correspondence and Hierarchical Clustering • Strategy 1: if a node contributes little to the image region, then it also contributes little to the word • Change p(D|d) equation (in beginning) to account for how blobs affect words. • Can alter p(l|d) to p(l|c) • Apply simple correspondence eq.
Integrated Correspondence: Strategy 2 • Strategy 2: Pair word and region • Need to change training alg to pair w and b for p(w,b) calculation • Changing the training algorithm: add graph matching • Create bipartite graph with words on one side and image regions on the other and the edges are weighted with the negative log probabilities from the equation • Find min cost assignment in graph matching • Resume EM alg.
Integrated Correspondence: NULL • Sometimes a region has no words, or the number of words and regions are different • Assign NULL when the annotation with the highest probability is still too low (but outliers?) • Tendency for error where NULL image regions generate every word or and every image region generate NULL word
Experiment • 160 CD’s, each with 100 images on a specific subject • Excluded words which occurred <20 times in test set • Vocabulary of about 155 words 160 CD’s 80 CD’s 80 CD’s Novel held out 75% images 25% images training Standard held out
Evaluating the model 2 1 3 N=# documents q(w|B)=computed predictive model, p(w)=target distribution, K=# words for the image • Annotation models are evaluated on both well represented and not well represented data • Correspondence models assume poor annotation means poor correspondence, otherwise would have to manually grade • Remember simple correspondence model is based on the annotation model but for individual blobs • Equation 1: negative Ekl = model is worse than empirical, positive is better
Evaluating word prediction 1 2 N=vocabulary size, n=#words for image,r=words predicted correctly, w=words predicted incorrectly • Equation 1 • returns 0 if everything or nothing is predicted • 1 for predicting exactly the actual word set • -1 for predicting the complement of the word set • Equation 2: larger values = better
Annotation results • Train model using a subset of training data, then use model as starting point for next set • Held out set: most benefit after 10 iterations • Novel held out data shows inability to generalize • Better to simultaneously learn models for blobs and their linkage to words
Normalized word prediction • Refuse to predict level • Designed to handle situations where an annotation does not mention an object • Requires a minimum probability to predict words • P = 10-(X/10) • Extremes result in predicting everything and predicting nothing
Correspondence Results • Discrete translation did the worst • Paired word-blob emissions did better than annotation based methods • Dependence of words on blobs performed the best Good annotation, bad correspondence Good results Complete failure
Experimental Decisions • Using only ⅜ of available data for training, separating ½ of total data for novel testing • Approximates correspondence performance by annotation performance • No absolute scale to compare errors between models or to future results • No true evaluation for correspondence results/didn’t actually evaluate how well each image region was labeled • Small vocabulary of 155 words means limited applications even with good results
Questionable Evaluation ? • p(w), the target distribution, is unknown. • So the paper assumes p(w)=1/K for observed words. • p(w)=0 for all other words in this assumption. • What is log(0)? • Could have solved this issue by smoothing p(w) Image from: https://courses.lumenlearning.com/waymakercollegealgebra/chapter/characteristics-of-logarithmic-functions/
Future Research • Moving from unsupervised input to a semi-supervised model • Research into evaluation methods which don’t required manual checking of labeled images • More robust datasets for word/image matching