Visual Object Recognition

Visual Object Recognition Bastian Leibe & Computer Vision Laboratory ETH Zurich Chicago, 14.07.2008 Kristen Grauman Department of Computer Sciences University of Texas in Austin

Outline Detection with Global Appearance & Sliding Windows Local Invariant Features: Detection & Description Specific Object Recognition with Local Features ― Coffee Break ― Visual Words: Indexing, Bags of Words Categorization Matching Local Feature Sets Part-Based Models for Categorization Current Challenges and Research Directions 2 K. Grauman, B. Leibe

Global representations: limitations • Success may rely on alignment -> sensitive to viewpoint • All parts of the image or window impact the description -> sensitive to occlusion, clutter K. Grauman, B. Leibe

Maximally Stable Extremal Regions [Matas 02] Shape context [Belongie 02] Superpixels [Ren et al.] SIFT [Lowe 99] Spin images [Johnson 99] Geometric Blur [Berg 05] Local representations • Describe component regions or patches separately. • Many options for detection & description… Salient regions [Kadir 01] Harris-Affine [Mikolajczyk 04] K. Grauman, B. Leibe

y1 y2 … yd x1 x2 … xd Recall: Invariant local features Subset of local feature types designed to be invariant to • Scale • Translation • Rotation • Affine transformations • Illumination • Detect interest points • Extract descriptors [Mikolajczyk01, Matas02, Tuytelaars04, Lowe99, Kadir01,… ] K. Grauman, B. Leibe

Aachen Cathedral Recognition with local feature sets • Previously, we saw how to use local invariant features + a global spatial model to recognize specific objects, using a planar object assumption. • Now, we’ll use local features for • Indexing-based recognition • Bags of words representations • Correspondence / matching kernels K. Grauman, B. Leibe

Basic flow … … Index each one into pool of descriptors from previously seen images … Describe features Detect or sample features List of positions, scales, orientations Associated list of d-dimensional descriptors K. Grauman, B. Leibe

Indexing local features • Each patch / region has a descriptor, which is a point in some high-dimensional feature space (e.g., SIFT) K. Grauman, B. Leibe

Indexing local features • When we see close points in feature space, we have similar descriptors, which indicates similar local content. Figure credit: A. Zisserman K. Grauman, B. Leibe

Indexing local features • We saw in the previous section how to use voting and pose clustering to identify objects using local features Figure credit: David Lowe K. Grauman, B. Leibe

Indexing local features • With potentially thousands of features per image, and hundreds to millions of images to search, how to efficiently find those that are relevant to a new image? • Low-dimensional descriptors : can use standard efficient data structures for nearest neighbor search • High-dimensional descriptors: approximate nearest neighbor search methods more practical • Inverted file indexing schemes K. Grauman, B. Leibe

Indexing local features: approximate nearest neighbor search Best-Bin First (BBF), a variant of k-d trees that uses priority queue to examine most promising branches first [Beis & Lowe, CVPR 1997] Locality-Sensitive Hashing (LSH), a randomized hashing technique using hash functions that map similar points to the same bin, with high probability [Indyk & Motwani, 1998] K. Grauman, B. Leibe

Indexing local features: inverted file index • For text documents, an efficient way to find all pages on which a word occurs is to use an index… • We want to find all images in which a feature occurs. • To use this idea, we’ll need to map our features to “visual words”. K. Grauman, B. Leibe

Visual words: main idea • Extract some local features from a number of images … e.g., SIFT descriptor space: each point is 128-dimensional Slide credit: D. Nister K. Grauman, B. Leibe

Visual words: main idea Slide credit: D. Nister K. Grauman, B. Leibe

Slide credit: D. Nister K. Grauman, B. Leibe

Visual words: main idea Map high-dimensional descriptors to tokens/words by quantizing the feature space • Quantize via clustering, let cluster centers be the prototype “words” Descriptor space K. Grauman, B. Leibe

Visual words: main idea Map high-dimensional descriptors to tokens/words by quantizing the feature space • Determine which word to assign to each new image region by finding the closest cluster center. Descriptor space K. Grauman, B. Leibe

Visual words • Example: each group of patches belongs to the same visual word Figure from Sivic & Zisserman, ICCV 2003 K. Grauman, B. Leibe

First explored for texture and material representations Texton = cluster center of filter responses over collection of images Describe textures and materials based on distribution of prototypical texture elements. Visual words Leung & Malik 1999; Varma & Zisserman, 2002; Lazebnik, Schmid & Ponce, 2003;

Visual words • More recently used for describing scenes and objects for the sake of indexing or classification. Sivic & Zisserman 2003; Csurka, Bray, Dance, & Fan 2004; many others. K. Grauman, B. Leibe

Inverted file index for images comprised of visual words List of image numbers Word number K. Grauman, B. Leibe Image credit: A. Zisserman

Bags of visual words • Summarize entire image based on its distribution (histogram) of word occurrences. • Analogous to bag of words representation commonly used for documents. Image credit: Fei-Fei Li K. Grauman, B. Leibe

Video Google System Query region • Collect all words within query region • Inverted file index to find relevant frames • Compare word counts • Spatial verification Sivic & Zisserman, ICCV 2003 • Demo online at : http://www.robots.ox.ac.uk/~vgg/research/vgoogle/index.html Retrieved frames K. Grauman, B. Leibe

Basic flow … … Index each one into pool of descriptors from previously seen images … or … Describe features Detect or sample features Quantize to form bag of words vector for the image List of positions, scales, orientations Associated list of d-dimensional descriptors K. Grauman, B. Leibe

Visual vocabulary formation Issues: • Sampling strategy • Clustering / quantization algorithm • Unsupervised vs. supervised • What corpus provides features (universal vocabulary?) • Vocabulary size, number of words K. Grauman, B. Leibe

Sampling strategies Sparse, at interest points Dense, uniformly Randomly • To find specific, textured objects, sparse sampling from interest points often more reliable. • Multiple complementary interest operators offer more image coverage. • For object categorization, dense sampling offers better coverage. • [See Nowak, Jurie & Triggs, ECCV 2006] Multiple interest operators Image credits: F-F. Li, E. Nowak, J. Sivic K. Grauman, B. Leibe

Clustering / quantization methods • k-means (typical choice), agglomerative clustering, mean-shift,… • Hierarchical clustering: allows faster insertion / word assignment while still allowing large vocabularies • Vocabulary tree [Nister & Stewenius, CVPR 2006] K. Grauman, B. Leibe

Example: Recognition with Vocabulary Tree • Tree construction: [Nister & Stewenius, CVPR’06] K. Grauman, B. Leibe Slide credit: David Nister

Vocabulary Tree • Training: Filling the tree [Nister & Stewenius, CVPR’06] K. Grauman, B. Leibe Slide credit: David Nister

Vocabulary Tree • Recognition RANSACverification [Nister & Stewenius, CVPR’06] K. Grauman, B. Leibe Slide credit: David Nister

Vocabulary Tree: Performance • Evaluated on large databases • Indexing with up to 1M images • Online recognition for databaseof 50,000 CD covers • Retrieval in ~1s • Find experimentally that large vocabularies can be beneficial for recognition [Nister & Stewenius, CVPR’06] K. Grauman, B. Leibe

Vocabulary formation • Ensembles of trees provide additional robustness Moosmann, Jurie, & Triggs 2006; Yeh, Lee, & Darrell 2007; Bosch, Zisserman, & Munoz 2007; … Figure credit: F. Jurie K. Grauman, B. Leibe

Supervised vocabulary formation • Recent work considers how to leverage labeled images when constructing the vocabulary Perronnin, Dance, Csurka, & Bressan, Adapted Vocabularies for Generic Visual Categorization, ECCV 2006. K. Grauman, B. Leibe

Supervised vocabulary formation • Merge words that don’t aid in discriminability Winn, Criminisi, & Minka, Object Categorization by Learned Universal Visual Dictionary, ICCV 2005

Supervised vocabulary formation • Consider vocabulary and classifier construction jointly. Yang, Jin, Sukthankar, & Jurie, Discriminative Visual Codebook Generation with Classifier Training for Object Category Recognition, CVPR 2008. K. Grauman, B. Leibe

Learning and recognition with bag of words histograms • Bag of words representation makes it possible to describe the unordered point set with a single vector (of fixed dimension across image examples) • Provides easy way to use distribution of feature types with various learning algorithms requiring vector input. K. Grauman, B. Leibe

Learning and recognition with bag of words histograms • …including unsupervised topic models designed for documents. • Hierarchical Bayesian text models (pLSA and LDA) • Hoffman 2001, Blei, Ng & Jordan, 2004 • For object and scene categorization: Sivic et al. 2005, Sudderth et al. 2005, Quelhas et al. 2005, Fei-Fei et al. 2005 K. Grauman, B. Leibe Figure credit: Fei-Fei Li

z d w N D “face” Learning and recognition with bag of words histograms • …including unsupervised topic models designed for documents. Probabilistic Latent Semantic Analysis (pLSA) Sivic et al. ICCV 2005 [pLSA code available at: http://www.robots.ox.ac.uk/~vgg/software/] K. Grauman, B. Leibe Figure credit: Fei-Fei Li

Bags of words: pros and cons + flexible to geometry / deformations / viewpoint + compact summary of image content + provides vector representation for sets + has yielded good recognition results in practice • basic model ignores geometry – must verify afterwards, or encode via features • background and foreground mixed when bag covers whole image • interest points or sampling: no guarantee to capture object-level parts • optimal vocabulary formation remains unclear K. Grauman, B. Leibe

Outline Detection with Global Appearance & Sliding Windows Local Invariant Features: Detection & Description Specific Object Recognition with Local Features ― Coffee Break ― Visual Words: Indexing, Bags of Words Categorization Matching Local Feature Sets Part-Based Models for Categorization Current Challenges and Research Directions 48 K. Grauman, B. Leibe

Visual Object Recognition