480 likes | 563 Views
A Higher-Level Visual Representation For Semantic Learning In Image Databases Ismail EL SAYAD 18/07/2011. Introduction. Related w ork s. Our approach. Experiments. Conclusion and perspectives. Overview. Introduction Related works Our approach Enhanced Bag of Visual Words ( E-BOW )
E N D
A Higher-Level Visual Representation For Semantic Learning In ImageDatabasesIsmail EL SAYAD18/07/2011
Introduction Related works Our approach Experiments Conclusion and perspectives Overview • Introduction • Related works • Our approach • Enhanced Bag of Visual Words (E-BOW) • Multilayer Semantically Significant Analysis Model (MSSA) • Semantically Significant Invariant Visual Glossary (SSIVG) • Experiments • Image retrieval • Image classification • Object Recognition • Conclusion and perspectives
Introduction Related works Our approach Experiments Conclusion and perspectives Motivation • This method suffers from subjectivity, text • ambiguity and the lack of automatic annotation • Digital content grows rapidly • Personal acquisition devices • Broadcast TV • Surveillance • Relatively easy to store, but useless if no automatic processing, classification, and retrieving • The usual way to solve this problem is by describing images by keywords.
Introduction Related works Our approach Experiments Conclusion and perspectives Visual representations Visual representations Image-based representations Part-based representations Image-based representations are based on global visual features extracted over the whole image like color, color moment, shape or texture
Introduction Related works Our approach Experiments Conclusion and perspectives Visual representations • The main drawbacks of Image-based representations: • High sensitivity to : • Scale • Pose • Lighting condition changes • Occlusions • Cannot capture the local information of an image • Part-based representations: • Based on the statistics of features extracted from segmented image regions
Introduction Related works Our approach Experiments Conclusion and perspectives Visual representationsPart-based representations (Bag of visual words) Feature space Visual word vocabulary VW1 VW2 VW3 VW4 . . . Compute local descriptors Feature clustering VW1 VW2 VW3 VW1 VW2 VW3 VW4 . . . 2 1 1 1 . . . VW4 Frequency VW1
Introduction Related works Our approach Experiments Conclusion and perspectives Visual representations Bag of visual words (BOW) drawbacks • Spatial information loss • Record number of occurrences • Ignore the position • Using only keypoints-based Intensity descriptors: • Neither shape nor color information is used • Feature quantization noisiness: • Unnecessary and insignificant visual words are generated
Introduction Related works Our approach Experiments Conclusion and perspectives Visual representationsDrawbacks Bag of Visual words (BOW) VW1364 VW1364 VW330 VW480 VW263 VW148 • Low discrimination power: • Different image semantics are represented by the same visual words • Low invariance for visual diversity: • One image semantic is represented by different visual words
Introduction Related works Our approach Experiments Conclusion and perspectives Objectives • Enhanced BOW representation • Different local information (intensity, color, shape…) • Spatial constitution of the image • Efficient visual word vocabulary structure • Higher-level visual representation • Less noisy • More discriminative • More invariant to the visual diversity
Introduction Related works • MSSA model Our approach Experiments Conclusion and perspectives • E-BOW Overview of the proposed higher-level visual representation SSVIWs & SSIVPs generation E-BOW representation SSIVG representation • SSIVG Learning the MSSA model Visual word vocabulary building Set of images
Introduction Related works Our approach Experiments Conclusion and perspectives • Introduction • Related works • Spatial Pyramid Matching Kernel (SPM) & sparse coding • Visual phrase & descriptive visual phrase • Visual phrase pattern & visual synset • Our approach • Experiments • Conclusion and perspectives
Introduction Related works Our approach Experiments Conclusion and perspectives Spatial Pyramid Matching Kernel (SPM) & sparse coding • Lazebnik et al. [CVPR06] • Spatial Pyramid Matching Kernel (SPM): exploiting the spatial information of location regions. • Yang et al. [CVPR09] • SPM + sparse coding: replacing k-means in the SPM
Introduction Related works Our approach Experiments Conclusion and perspectives Visual phrase & descriptive visual phrase • Zheng and Gao [TOMCCAP08] • Visual phrase: pair of spatially adjacent local image patches • Zhang et al. [ACM MM09] • Descriptive visual phrase: selected according to the frequencies of its constituent visual word pairs
Introduction Related works Our approach Experiments Conclusion and perspectives Visual phrase pattern & visual sysnet • Yuan et al. [CVPR07] • Visual phrase pattern: spatially co-occurring group of visual words • Zheng et al. [CVPR08] • Visual synset: relevance-consistent group of visual words or phrases in the spirit of the text synset
Introduction Related works Our approach Experiments Conclusion and perspectives Comparison of the different enhancements of the BOW
Introduction Related works Our approach Experiments Conclusion and perspectives • Introduction • Related works • Our approach • Enhanced Bag of Visual Words (E-BOW) • Multilayer Semantically Significant Analysis Model (MSSA) • Semantically Significant Invariant Visual Glossary (SSIVG) • Experiments • Conclusion and perspectives
Introduction Related works Our approach Experiments Conclusion and perspectives Enhanced Bag of Visual Words (E-BOW) Set of images • E-BOW • MSSA model • SSIVG SURF & Edge Context extraction Features fusion Hierarchal features quantization E-BOW representation
Introduction Related works Our approach Experiments Conclusion and perspectives Enhanced Bag of Visual Words (E-BOW)Feature extraction Interest points detection Edge points detection Colorfiltering using vector median filter (VMF ) SURF feature vector extraction at each interest point Colorfeature extraction at each interest and edge point Fusion of the SURF and edgecontextfeaturevectors Color and position vector clustering using Gaussian mixture model Edge Context feature vector extraction at each interest point Collection of all vectors for the whole image set ∑1 µ1Pi1 ∑2 µ2Pi2 ∑3 µ3Pi3 HAC and Divisive Hierarchical K-Means clustering VW vocabulary
Introduction Related works Our approach Experiments Conclusion and perspectives Enhanced Bag of Visual Words (E-BOW)Feature extraction (SURF) • SURF is a low-level feature descriptor • Describes how the pixel intensities are distributed within a scale dependent neighborhood of each interest point. • Good at • Handling serious blurring • Handling image rotation • Poor at • Handling illumination change • Efficient
Introduction Related works Our approach Experiments Conclusion and perspectives Enhanced Bag of Visual Words (E-BOW)Feature extraction (Edge Context descriptor) • Edge context descriptor is represented at each interest point as a histogram : • 6 bins for the magnitude of the drawn vectors to the edge points • 4 bins for the orientation angle
Introduction Related works Our approach Experiments Conclusion and perspectives Enhanced Bag of Visual Words (E-BOW)Feature extraction (Edge context descriptor) This descriptor is invariant to : • Translation : • The distribution of the edge points is measured with respect to fixed points • Scale: • The radial distance is normalized by a mean distance between the whole set of points within the same Gaussian • Rotation: • All angles are measured relative to the tangent angle of each interest point
Introduction Related works Our approach Experiments Conclusion and perspectives • Enhanced Bag of Visual Words (E-BOW)Hierarchalfeature quantization Hierarchical Agglomerative Clustering (HAC) Divisive Hierarchical K-Means Clustering Stop clustering at desired level k k clusters from HAC … The tree is determined level by level, down to some maximum number of levels L, and each division into k parts. Merged feature in the feature space A cluster at k =4 • Visual word vocabulary is created by clustering the observed merged features (SURF + Edge context 88 D) in 2 clustering steps:
Introduction Related works Our approach Experiments Conclusion and perspectives Multilayer Semantically Significant Analysis (MSSA) model Set of images • E-BOW • MSSA model • SSIVG SURF & Edge Context extraction VWs semantic inference estimation Features fusion Number of latent topics Estimation Hierarchal features quantization Parameters estimation E-BOW representation Generative process
Introduction Related works Our approach Experiments Conclusion and perspectives Multilayer Semantically Significant Analysis (MSSA) model Generative Process Different Visualaspects A topic model that considers this hierarchal structure is needed Higher-level aspect: People
Introduction Related works Our approach Experiments Conclusion and perspectives Multilayer Semantically Significant Analysis (MSSA) model Generative Process Θ φ Ψ im h v V W M N • In the MSSA, there are two different latent (hidden) topics: • High latent topic that represents the high aspects • Visual latent topic that represents the visual aspects
Introduction Related works Our approach Experiments Conclusion and perspectives Multilayer Semantically Significant Analysis (MSSA) model Parameter Estimation • Probability distribution function : • Log-likelihood function : • Gaussier et al. [ ACM SIGIR05]: maximizing the likelihood can be seen as a Nonnegative Matrix Factorization (NMF) problem under the generalized KL divergence • Objective function:
Introduction Related works Our approach Experiments Conclusion and perspectives Multilayer Semantically Significant Analysis (MSSA) model Parameter Estimation • KKT conditions are used to derive the multiplicative update rules for minimizing the objective function • This leads to the following multiplicative update rules :
Introduction Related works Our approach Experiments Conclusion and perspectives Multilayer Semantically Significant Analysis (MSSA) modelNumber of Latent Topics Estimation • Minimum Description Length (MDL) is used as a model selection criteria • Number of the high latent topics (L) • Number of the visual latent topics (K) • is the log-likelihood • is the number of free parameters:
Introduction Related works Our approach Experiments Conclusion and perspectives Semantically Significant Invariant Visual Glossary (SSIVG) representation Set of images • E-BOW • MSSA model • SSIVG SURF & Edge Context extraction VWs semantic inference estimation SSVP representation SSIVG representation Features fusion Number of latent topics Estimation SSVPs generation SSIVP representation Hierarchal features quantization Parameters estimation SSVW representation Divisive theoretic clustering SSIVW representation E-BOW representation Generative process SSVWs selection
Introduction Related works Our approach Experiments Conclusion and perspectives Semantically Significant Invariant Visual Glossary (SSIVG) representationSemantically Significant Visual Word (SSVW) Set of relevant visual topics Set of VWs Estimating using MSSA Set of SSVWs Estimating using MSSA
Introduction Related works Our approach Experiments Conclusion and perspectives Semantically Significant Invariant Visual Glossary (SSIVG) representationSemantically significant Visual Phrase (SSVP) • SSVP: Higher-level and more discriminative representation • SSVWs + their inter-relationships • SSVPs are formed from SSVW sets that satisfy all the following conditions: • Occur in the same spatial context • Involved in strong association rules • High support and confidence • Have the same semantic meaning • High probability related to at least one common visual latent topic
Introduction Related works Our approach Experiments Conclusion and perspectives Semantically Significant Invariant Visual Glossary (SSIVG) representationSemantically Significant Visual Phrase (SSVP) SSIVP126 SSIVP126 SSIVP326 SSIVP326 SSIVP304 SSIVP304
Introduction Related works Our approach Experiments Conclusion and perspectives Semantically Significant Invariant Visual Glossary (SSIVG) representationInvariance Problem • Studying the co-occurrence and spatial scatter information make the image representation more discriminative • The invariance power of SSVWs and SSVPs is still low • Text documents • Synonymous words can be clustered into one synonymy set to improve the document categorization performance
Introduction Related works Our approach Experiments Conclusion and perspectives • Semantically Significant Invariant Visual Glossary (SSIVG) representation Set of relevant visual topics Set of SSVWs and SSVPs Estimating using MSSA Estimating using MSSA Divisive theoretic clustering Set of SSIVGs Set of SSIVPs Set of SSIVWs • SSIVG : higher-level visual representation composed from two different layers of representation • Semantically Significant Invariant Visual Word (SSIVW) • Re-indexed SSVWs after a distributional clustering • Semantically Significant Invariant Visual Phrases (SSIVP) • Re-indexed SSVPs after a distributional clustering
Introduction Related works Our approach Experiments Conclusion and perspectives Experiments • Introduction • Related works • Our approach • Experiments • Image retrieval • Image classification • Object Recognition • Conclusion and perspectives
Introduction Related works Our approach Experiments Conclusion and perspectives • Assessment of the SSIVG representation performance in image retrieval • Evaluation criteria : • Mean Average Precision (MAP) • The traditional Vector Space Model of Information Retrieval is adapted • The weighting for the SSIVP • Spatial weighting for the SSIVW • The inverted file structure
Introduction Related works Our approach Experiments Conclusion and perspectives • Assessment of the SSIVG representation Performance in image retrieval
Introduction Related works Our approach Experiments Conclusion and perspectives • Assessment of the SSIVG representation performance in image retrieval 38
Introduction Related works Our approach Experiments Conclusion and perspectives • Evaluation of the SSIVG representation in image classification • Evaluation criteria : • Classification Average Precision over each class • Classifiers : • SVM with linear kernel • Multiclass Vote-Based Classifier (MVBC)
Introduction Related works Our approach Experiments Conclusion and perspectives • Evaluation of the SSIVG representation in image classification • Multiclass Vote-Based Classifier (MVBC) • For each , we detect the high latent topic that maximizes: • is The final voting score for a high latent topic : is Each image is categorized according to the dominant high latent
Introduction Related works Our approach Experiments Conclusion and perspectives • Evaluation of the SSIVG representation performance in classification
Introduction Related works Our approach Experiments Conclusion and perspectives • Assessment of the SSIVG representation Performance in object recognition • Each test image is recognized by predicting the object class using the SSIVG representation and the MVBC • Evaluation criteria: • Classification Average Precision (AP) over each object class
Introduction Related works Our approach Experiments Conclusion and perspectives • Assessment of the SSIVG Representation Performance in Object Recognition
Introduction Related works Our approach Experiments Conclusion and perspectives Experiments Introduction Related works Our approach Experiments Conclusion and perspectives
Introduction Related works Our approach Experiments Conclusion and perspectives • Conclusion • Enhanced BOW (E-BOW) representation • Modeling the spatial-color image constitution using GMM • New local feature descriptor (Edge Context) • Efficient visual word vocabulary structure • New Multilayer Semantic Significance (MSSA) model • Semantic inferences of different layers of representation • Semantically Significant Visual Glossary (SSIVG) • More discriminative • More invariant to visual diversity • Experimental validation • Outperform other sate of the art works
Introduction Related works Our approach Experiments Conclusion and perspectives • Perspectives • MSSA Parameters update • On-line algorithms to continuously (re-)learn the parameters • Invariance issue • Context large-scale databases where large intra-class variations can occur • Cross-modalitily extension to video content • Cross-modal data (visual and textual closed captions contents) • New generic framework of video summarization • Study the semantic coherence between visual contents and textual captions
Questions ? Thank you for your attention ! ismail.elsayad@lifl.fr
Introduction Related works Our approach Experiments Conclusion and perspectives • Parameter Settings