1 / 15

XRCE at ImageCLEF 07

XRCE at ImageCLEF 07. Stephane Clinchant, Jean-Michel Renders and Gabriela Csurka Xerox Research Centre Europe France. Outline. Problem statement Image Similarity Text Similarity Fusion between text and image Cross-Media Similarities Experimental results Conclusion. Problem Statement.

hipp
Download Presentation

XRCE at ImageCLEF 07

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XRCE at ImageCLEF 07 Stephane Clinchant, Jean-Michel Renders and Gabriela Csurka Xerox Research Centre Europe France

  2. Outline • Problem statement • Image Similarity • Text Similarity • Fusion between text and image • Cross-Media Similarities • Experimental results • Conclusion

  3. Problem Statement 1 Query Image(s) + text Ranked Documents Image similarity 3 Database Cross-media similarity 2 Text similarity • Problem: • Retrieve relevant images from a given cross-media database (images with text) given a set of query images and a query text • Proposed solutions: • Rank the images in the database based on image similarity (1), text similarity (2) and cross-media similarities (3)

  4. Image Similarity • The goal is to define an image similarity measure that is able to “best” reflect a “semantic” similarity of the images. • E.g. sim( , ) > sim( , ) • Our proposed solution (detailed in next slides) is to • consider both local color and local texture features • build a generative model (GMM) in the low level feature space • represent the image based on Fisher Kernel Principles • define a similarity measure between Fisher Vectors

  5. Fisher Vector • Given a generative model with parameters  (GMM) • the gradient vector • normalized by the Fisher information matrix • leads to a unique “model-dependent” representation of the image, called Fisher Vector • As similarity between Fisher vectors the L1-norm was used: Fisher Kernels on Visual Vocabularies for Image Categorization, F. Perronnin and C. Dance, CVPR 2007.

  6. Text similarity • The text is first pre-processed including: • Tokenization, lemmatization, word decompounding and stop-word removal • The text is modeled by a multinomial language model and smoothed via Jelinek-Mercer method: • where pML(w |d ) #(w, d) and pML(w| C ) d#(w,d) • The textual similarity between two documents is defined by the cross-entropy function:

  7. Enriching the text using external corpus • Reason: the texts related to the images in the corpus are poor (title only). • How: each “text” in the corpus was enriched as follows: • For each terms in the document we add related terms based on their clustered usage analysis an external corpus • The external corpus was the Flickr image database • The relationship between terms was based on the frequency of their co-occurrence as “tags” for the same image in Flickr (see top 5 ex. below)

  8. Fusion between image and text • Early fusion: • Simple concatenation of image and text features (e.g. bag-of-words and bag-of-visual-words) • Estimating their co-occurences or joint probabilities (Mori et al, Vinokourov et al, Duygulu et al, Blei et al, Jeon et al, etc ) • Late fusion • Simply combining the scores of mono-media searches (Maillot et al, Clinchant et al) • Intermediate level fusion • Relevance models (Jeon et al ) • Trans-media (or intermedia) feedback (Maillot et al, Chang et al)

  9. Intermediate level fusion • Compute mono-media similarities between an aggregate of objects coming from a first retrieval step and a multimodal object . • Use the duality of data to switch media during feedback process Pseudo Feedback: Top N ranked images based on image similarity Final rank: Re-ranked documents based on textual similarity … Aggregate textual information …

  10. Aggregate information from pseudo-feedback • Aim: • Compute similarities between an aggregate of objects Nimg(q) corresponding to a first retrieval for query q and a new multimodal object u in the Corpus • Where Nimg(q)={T(I1), T(I2)… T(IN)} , T(Ik) is the textual part of the kth image Ik inthe(pseudo)-feedback group based on image similarity • Possible solutions: • Direct Concatenation: Aggregate (concatenate) T(Ik), k=1,N to form a single objectand compute text similarity between it and T(u). • Trans-media document re-ranking: Aggregate all similarity measures between couple of objects . • Complementary (or Inter-media) Feedback: Use a pseudo feedback algorithm to extract relevant features of Nimg(q) and use them to compute the similarity with T(u).

  11. Trans-media document re-ranking • We define the following similarity measure between an aggregate of objects Nimg and a multimodal object u: • Notes • This approach can be seen as a document re-ranking method instead of a query expansion mechanism. • The values simTXT(T(u),T(v)) can be pre-computed offlineif the corpus is of reasonable size. • By duality, we can inverse the role of images and text:

  12. Complementary Feedback • We derive a LM (F)for the “relevance concept” from the text set F=Nimg(q): • F is assumed to be multinomial (peaked at relevant terms) estimated by EM from: where P(w|C) is word probability built upon the Corpus, and  (=0.5) a fixed parameter. • The similarities between Nimg and T(u) is given by the cross-entropy similarity between F and T(u) or we can first interpolate F with the query text: • Notes •  (=0.5 in our exp) can be seen as a mixing weight between image and text • Unlike, trans-media re-ranking method, it needs a second retrieval step. • We can inverse the role of images and text if we use Rocchio’s method instead of (1). A Study of smoothing methods for Language Models applied to Information Retrieval, Zhai and Lafferty, SIGIR 2001.

  13. XRCE’s ImageCLEF Runs • LM: language model with cross entropy • FV+L1: Fisher Vector with L1 norm • FLR: - text enriched by Flicker tags • TR: Transmedia Reranking • CF: Complementary Feedback • Ri: Run I • QT: Query Translation

  14. Conclusion • Our image similarity measure (L1 norm on Ficher Vectors) seems to be quite suitable for CBIR. • It was the second best “Visual Only ” system and unlike the first system it does not used any query expansion (nor feedback) • Combining it with text similarity within an “intermediate level fusion” allowed for a significant improvement. • Mixing the modalities increased the performance of about ~50% (relative) over mono-media (pure text or pure image) systems . • Three out of six proposed cross-media systems were the best three “Automatic Mixed Runs” . • The system well performed even when the query and the Corpus were in different languages (English versus German).

  15. Thank you for your attention!

More Related