830 likes | 920 Views
Semantics of words and images. Presented by Gal Zehavi & Ilan Gendelman. What is semantics?. Semantics – “ a branch of philosophy dealing with the relations between signs and what they refer to (their meaning)” (Webster). Content. 1) Motivation. Better segments aggregation.
E N D
Semantics of words and images Presented by Gal Zehavi & Ilan Gendelman
What is semantics? • Semantics – “a branch of philosophy dealing with the relations between signs and what they refer to (their meaning)” (Webster)
Content 1) Motivation
Better segments aggregation • Image reconstruction Motivation • Object recognition waterfall bear fish Motivation
More motivation… • Data mining: Content Based Image Retrieval (CBIR) higher precision / higher recall / quicker search • Applications • Biomedicine (X-ray, Pathology, CT, MRI, …) • Government (radar, aerial, trademark, …) • Commercial (fashion catalog, journalism, …) • Cultural (Museums, art galleries, …) • Education and training • Entertainment, WWW (100 billions?!), … Motivation
Auto annotator Moby Dick Even more motivation… • Auto illustrator Ocean Helicopter Shark Man Bridge Mountain Motivation
Content • Motivation • Introduction to Semantics
Different aspects of semantics… Introduction to Semantics
Specific Objects • The elephant played chess Introduction to Semantics
Family of Objects • The tree stood alone on the hill. • This car is as fast as the wind. Introduction to Semantics
Scenarios • A couple on the beach Easy to imagine (statistically clear) but variety is large… Introduction to Semantics
Semantics from Context The captain was on the bridge. Introduction to Semantics
Abstract Semantics • Vacation • Strength ? • Experience Introduction to Semantics
Content • Motivation • Introduction to Semantics • Difficulties
Difficulties in Text-Image Semantics • Ambiguity corals couple, beach, sunset • Level of Abstraction men, bottle, liquid drivers, car race, champagne celebration, winning, happiness M. Schumacher, Formula 1, Dom-Perignon competitors, race, alcohol Difficulties
Content • Motivation • Introduction to Semantics • Difficulties • Possible Approaches
Possible Approaches • Searching images by text only + use existing text search infrastructure + no specific processing - images must be in textual context - missing real image semantics and features • Query by example + relating to user’s visualisation - missing real image semantics • Semantics through user interaction + refined visualisation + higher level of abstraction - a complex user interface is required • Search by image feature similarity + low complexity - missing real image semantics • Search by segment features Possible approaches
Content • Motivation • Introduction to Semantics • Difficulties • Possible Approaches • Models
Models • Object recognition as machine translation • “Object Recognition as Machine translation: Learning a Lexicon for a Fixed Image Vocabulary” P.Duygulu, K.Barnard, N.Freitas and D.Forsyth (2002) • Learning semantics by hierarchical clustering of words and image regions • "Learning the semantics of words and pictures“ K.Barnard, D.Forsyth (2001) • "Clustering Art” K.Barnard, P.Duygulu, D.Forsyth (2001) Models
Content • Motivation • Introduction to Semantics • Difficulties • Possible Approaches • Models • Model #1 : • Object recognition as machine translation
Object recognition as machine translation • Description: learning a lexicon for a fixed image vocabulary Model #1
sky rock mountain/forest sea sand Our goal is describing a scenery using the lexicon we learned. Model #1
I am a great man אני איש גדול\נהדר How do we do it? • By applying a similar method to a language translator • translating words by using many sentences correspondence, and creating a statistic. Model #1
The blob notion • First we segment the image into regions using min-cut method. Model #1
Assigning Regions to Blobs • How do we assign a region to a blob? Eliminate regions < threshold Define a set of features Discretise each feature distribution Cluster the finite dimension vectors Cluster = Blob Model #1
In the article experiment • We discretise the 33 different features using the k – mean algorithm • the flowing features are included *region color *convexity *standard deviation *first moment *region orientation energy *region size *location Model #1
For applying the method to images we need to discretise the image • Using the Corel data base with 371 words and 4500 images that have an 5 to 10 segment each. ~35000 segments – 500 blobs Model #1
The K-means algorithm • Matching the best k values to a continues histogram • Given a distribution of a n – dimensional vector we can create a cluster Model #1
Input : continues data • Some Data K-means Output : discrete data Model #1 Acknowledgments to Andrew W. Moore, Carnegie Mellon University
Randomly guess k center locations • Each data point “belongs” to one center • Each center finds the centroid of its points • Centroid defines the new center K-means • Iterative algorithm • Choose k (i.e 5) • Iterate… Model #1 Acknowledgments to Andrew W. Moore, Carnegie Mellon University
K-means • An example Model #1 Acknowledgments to Andrew W. Moore, Carnegie Mellon University
Now for every image we have a set of blobs • Match with a set of words • So the translation can begin Model #1
Notation The set of blobs b = (bn1 bn2 ……. bnl) where “l” is the size of the blobs string The set of words w = (wn1 wn2 …. wnm) where “m” is the size of the words string The aliment an= (an1 an2 ……. anm) where “n” is the nth image The eventanj=i means that the jth word in the possible translation translates the ith blob Model #1
w1 w2 w3 w4 …. wm b1 b2 b3 …. bl More about aliment space A(w,b) • For b = (b1 b2 ……. bl) and w =(w1 w2 …. wm) So a = (a1 a2 …. am) is a series taking the values 0 – l representing a discreet function Possible a Model #1
The likelihood function We call the conditional probability p(w|b) the likelihood function Since it gives us the probability distribution that a set of words is the translation given a set of blobs. How do we generate it ? Model #1
String size m P(w2|a2,w1,b,m) P(w3|a3,w2,b,m) P(w1|a1,b,m) Possible translation cloud sky sun P(m|b) P(a1|b,m) P(a3|a2,w2,b,m) P(a2|a1,w1,b,m) Set of Blobs given b b1 b2 b3 lexicon:(book) (chair) (sky) (tree) (sun) (fish) (ship) (ring) (sea) (cloud) … w1 w2 w3 P(w|b,a ) =P(m|b) P(a1|b,m) P(w1|a1,b,m) P(a3|a2,w2,b,m) P(w3|a3,w2,b,m) P(a2|a1,w1,b,m)P(w2|a2,w1,b,m) Model #1
Finally we get, without loss of generality : Problem with this formulation is the enormous number of system parameters Model #1
A more simple model Assumptions : • Disregarding the context of the blobs and the words 2) Assume that the aliment is affected only from the position of the translating word 3) Assuming that translating strings could have any length Model #1
String size m t(w3|ba3)=P(w3|a3,w2,b,m) Possible word translation sun P(m|b)=const w1 w2 w3 P(a3|3, b,m)=P(a3|a2,w2,b,m) Set of Blobs given b b1 b2 b3 lexicon:(book) (chair) (sky) (tree) (sun) (fish) (ship) (ring) (sea) (cloud) … Model #1
Our mathematical goal is to manufacture a probability table that represents the probability for a given blob the distribution of all the possible translations Model #1
E-step: defining the expectation of the complete-data log likelihood And computing it Model #1
M-step: maximizing the expectation we computed Taking into consideration the following constraints and We can use the LaGrange multipliers to maximize the likelihood function The obtained lagrngian is And the equations needed to be solved for maximization are : Model #1
Solving this set of equations yields a new set that converges iteratively Model #1
Further refinements • Words may not be predicted with highest probability for any blob. • Assigning NULL words when P(word|blob) > threshold Rerunning process Choosing smaller lexicon Model #1
Indistinguishable words • Visually indistinguishable • Practically indistinguishable • Entangled correspondence • polar – bear mare/foals - horse Rerunning process Clustering similar words Model #1
Experimental Results Settings: • 4500 Corel images, with 4-5 keywords each • 371 words vocabulary • typically 5-10 regions each image • 500 blobs • 33 features for each region Model #1 - Results
Original words Original words precision recall Refitted words Refitted words precision recall Clustered words Clustered words precision recall Annotation Recall / Precision 500 test images Only 80 words predicted Model #1 - Results
Original words Clustered words Null threshold 0.2 Prediction rate Prediction rate Prediction rate Light blue – average # of times a blob predicts the word correctly in the right place Dark blue – total # of time a blob predicts the word, which is one of the image keywords Correspondence 100 test images Model #1 - Results
Some Results • Successful results Model #1 - Results