490 likes | 675 Views
Learning Local Affine Representations for Texture and Object Recognition. Svetlana Lazebnik Beckman Institute, University of Illinois at Urbana-Champaign (joint work with Cordelia Schmid, Jean Ponce). Overview. Goal: Recognition of 3D textured surfaces, object classes Our contribution:
E N D
Learning Local Affine Representations for Texture and Object Recognition Svetlana Lazebnik Beckman Institute, University of Illinois at Urbana-Champaign (joint work with Cordelia Schmid, Jean Ponce)
Overview • Goal: • Recognition of 3D textured surfaces, object classes • Our contribution: • Texture and object representations based on local affine regions • Advantages of proposed approach: • Distinctive, repeatable primitives • Robustness to clutter and occlusion • Ability to approximate 3D geometric transformations
The Scope • Recognition of single-texture images (CVPR 2003) • Recognition of individual texture regions in multi-texture images (ICCV 2003) • Recognition of object classes (BMVC 2004, work in progress)
Affine Region Detectors Harris detector (H) Laplacian detector (L) Mikolajczyk & Schmid (2002), Gårding & Lindeberg (1996)
Affine Rectification Process Patch 1 Patch 2 Rectified patches (rotational ambiguity)
Rotation-Invariant Descriptors 1: Spin Images • Based on range spin images (Johnson & Hebert 1998) • Two-dimensional histogram: distance from center × intensity value
Rotation-Invariant Descriptors 2: RIFT • Based on SIFT (Lowe 1999) • Two-dimensional histogram: distance from center × gradient orientation • Gradient orientation is measured w.r.t. to the direction pointing from the center of the patch
Signatures and EMD • SignaturesS = {(m1, w1), … , (mk, wk)}mi — cluster centerwi — relative weight • Earth Mover’s Distance (Rubner et al. 1998) • Computed from ground distances d(mi, m'j) • Can compare signatures of different sizes • Insensitive to the number of clusters
Database: Textured Surfaces 25 textures, 40 sample images each (640x480)
Evaluation • Channels: HS, HR, LS, LR • Combined through addition of EMD matrices • Classification results • 10 training images per class, rates averaged over 200 random training subsets
(H+L)(S+R) VZ-Joint VZ-MRF Results of Evaluation:Classification rate vs. number of training samples • Conclusion: an intrinsically invariant representation is necessary to deal with intra-class variations when they are not adequately represented in the training set
Summary • A sparse texture representation based on local affine regions • Two novel descriptors (spin images, RIFT) • Successful recognition in the presence of viewpoint changes, non-rigidity, non-homogeneity • A flexible approach to invariance
2. Recognition of Individual Regions in Multi-Texture Images • A two-layer architecture: • Local appearance + neighborhood relations • Learning: • Represent the local appearance of each texture class using a mixture-of-Gaussians model • Compute co-occurrence statistics of sub-class labels over affinely adapted neighborhoods • Recognition: • Obtain initial class membership probabilities from the generative model • Use relaxation to refine these probabilities
Two Learning Scenarios • Fully supervised: every region in the training image is labeled with its texture class • Weakly supervised: each training image is labeled with the classes occurring in it brick brick, marble, carpet
Neighborhood Statistics • Estimate: • probability p(c,c') • correlation r(c,c') Neighborhood definition
Relaxation (Rosenfeld et al. 1976) • Iterative process: • Initialized with posterior probabilities p(c|xi) obtained from the generative model • For each region i and each sub-class label c, update the probability pi(c) based on neighbor probabilities pj(c') and correlations r(c,c') • Shortcomings: • No formal guarantee of convergence • After the initialization, the updates to the probability values do not depend on the image data
Experiment 1: 3D Textured Surfaces Single-texture images T1 (brick) T2 (carpet) T3 (chair) T4 (floor 1) T5 (floor 2) T6 (marble) T7 (wood) Multi-texture images 10 single-texture training images per class, 13 two-texture training images, 45 multi-texture test images
Effect of Relaxation on Labeling Original image Top: before relaxation, bottom: after relaxation
Retrieval (single-texture training images) T1 (brick) T2 (carpet) T3 (chair) T4 (floor 1) T5 (floor 2) T6 (marble) T7 (wood)
Experiment 2: Animals • No manual segmentation • Training data: 10 sample images per class • Test data: 20 samples per class + 20 negative images cheetah, background zebra, background giraffe, background
Summary Future Work • A two-level representation (local appearance + neighborhood relations) • Weakly supervised learning of texture models • Design an improved representation using a random field framework, e.g., conditional random fields (Lafferty 2001, Kumar & Hebert 2003) • Develop a procedure for weakly supervised learning of random field parameters • Apply method to recognition of natural texture categories
3. Recognition of Object Classes The approach: • Represent objects using multiple composite semi-local affine parts • More expressive than individual regions • Not globally rigid • Correspondence search is key to learning and detection
Correspondence Search • Basic operation: a two-image matching procedure for finding collections of affine regions that can be mapped onto each other using a single affine transformation • Implementation: greedy search based on geometric and photometric consistency constraints • Returns multiple correspondence hypotheses • Automatically determines number of regions in correspondence • Works on unsegmented, cluttered images (weakly supervised learning) A
Matching: 3D Objects closeup closeup
Matching: Faces spurious match ???
Learning Object Models for Recognition • Match multiple pairs of training images to produce a set of candidate parts • Use additional validation images to evaluate repeatability of parts and individual regions • Retain a fixed number of parts having the best repeatability score
Recognition Experiment: Butterflies Admiral Swallowtail Machaon Monarch 1 Monarch 2 Peacock Zebra • 16 training images (8 pairs) per class • 10 validation images per class • 437 test images • 619 images total
Recognition • Top 10 parts per class used for recognition • Relative repeatability score: • Classification results: total number of regions detectedtotal part size Total part size (smallest/largest)
Detection Results (ROC Curves) Circles: reference relative repeatability rates. Red square: ROC equal error rate (in parentheses)
Successful Detection Examples Training images Test images (blue: occluded regions) All ellipses found in the test images
Unsuccessful Detection Examples Training images Test images (blue: occluded regions) All ellipses found in the test image
Summary Summary • Semi-local affine parts for describing structure of 3D objects • Finding a part vocabulary: • Correspondence search between pairs of images • Validation • Additional application: • Finding symmetry and repetition Future Work • Find a better affine region detector • Represent, learn inter-part relations • Evaluation: CalTech database, harder classes, etc.
Birds Egret Puffin Snowy Owl Mandarin Duck Wood Duck
Birds: Candidate Parts Mandarin Duck Puffin
Objects without Characteristic Texture (LeCun’04)
Summary of Talk • Recognition of single-texture images • Distribution of local appearance descriptors • Recognition of individual regions in multi-texture images • Local appearance + loose statistical neighborhood relations • Recognition of object categories • Local appearance + strong geometric relations For more information: http://www-cvr.ai.uiuc.edu/ponce_grp
Issues, Extensions • Weakly supervised learning • Evaluation methods? • Learning from contaminated data? • Probabilistic vs. geometric approaches to invariance • EM vs. direct correspondence search • Training set size • Background modeling • Strengthening the representation • Heterogeneous local features • Automatic feature selection • Inter-part relations