160 likes | 329 Views
A Thousand Words in a Scene. P. Quelhas, F. Monay, J. Odobez, D. Gatica-Perez and T. Tuytelaars PAMI, Sept. 2006. Outline. Introduction Image Representation Bag-of-Visterms (BOV) Representation Probabilistic Latent Semantic Analysis (PLSA) Scene Classification Experiments
E N D
A Thousand Words in a Scene P. Quelhas, F. Monay, J. Odobez, D. Gatica-Perez and T. Tuytelaars PAMI, Sept. 2006
Outline • Introduction • Image Representation • Bag-of-Visterms (BOV) Representation • Probabilistic Latent Semantic Analysis (PLSA) • Scene Classification • Experiments • Classification • Image Ranking • Conclusion
Introduction • Main work • Scene modeling and classification • What’s new? • Combine text modeling methods and local invariant features to represent an image. • A text-like bag-of-visterms representation (histogram of quantized local visual features) • Probabilistic Latent Semantic Analysis (PLSA) • Scene classification is based on the image representation • Scenes can be ranked via PLSA
Introduction • Framework An image Interest pointdetector Low level feature extraction Feature Extraction Local descriptors Approach to text-like representation Quantization Text-modeling methods BOV PLSA Classification (SVM) Classification / ranking
Image Representation • Local invariant features • Interest point detection • Extract characteristic points and more generally regions from the images. • Invariant to geometric and photometric transformations, given an image and transformed versions, same points are extracted. • Employ the Difference of Gaussians (DOG) point detector: • Compare a point with its eight neighbors to find minimum/maximum. • Invariant to translation, scale, rotation and illumination variations.
Image Representation • Local descriptors • Compute the descriptor on the region around each interest point. • Use Scale Invariant Feature Transform (SIFT) feature as local descriptor. • Low level feature extraction example Each point has a feature vector of 128D
Image Representation • Quantization • Quantize each local descriptor into a symbol via K-means • Bag-of-visterms representation • Histogram of the visterms • Cons: no spatial information between visterms.
Image Representation • Probabilistic Latent Semantic Analysis (PLSA) • Introduce latent variables zl, called aspect, and associate a zl with each observation (visterm), • Build a joint probability model over images and visterms • Likelihood of the model parameters is • Image representation
Image Representation • Polysemy and synonymy with visterms • Polysemy: a single visterm may represent different scene content. • Synonymy: several visterms may characterized the same image content. • Example: • samples from 3 randomly selected visterms from a vocabulary of size 1000. • not all visterms have a clear semantic interpretation. • Pros of PLSA • Introduce aspect to capture visterm co-occurrence, thus can handle polysemy and synonymy issues.
Experiments • Classification • BOV classification (three-class) • Dataset: indoor, city, landscape • Training&testing: the whole dataset is slip into 10 parts, one for training, the other 9 for testing. • Baseline methods: histograms on low-level features;
Experiments • PLSA classification (three-class) • PLSA-I: use the same part of data to train SVM as well as learning the aspect models. • PLSA-O: use an auxiliarty dataset to learn the aspect models.
Experiments • Aspect-based image ranking • Given an aspect z, images can be ranked according to • Dataset: landscape/city
Conclusion • The proposed scene modeling method is effective for scene classification • A visual scene is presented as a mixture of aspects in PLSA modeling.