550 likes | 791 Views
NIPS 2003 Tutorial Real-time Object Recognition using Invariant Local Image Features. David Lowe Computer Science Department University of British Columbia. Object Recognition. Definition: Identify an object and determine its pose and model parameters Commercial object recognition
E N D
NIPS 2003 TutorialReal-time Object Recognition using Invariant Local Image Features David Lowe Computer Science Department University of British Columbia
Object Recognition • Definition: Identify an object and determine its pose and model parameters • Commercial object recognition • Currently a $4 billion/year industry for inspection and assembly • Almost entirely based on template matching • Upcoming applications • Mobile robots, toys, user interfaces • Location recognition • Digital camera panoramas, 3D scene modeling
Invariant Local Features • Image content is transformed into local feature coordinates that are invariant to translation, rotation, scale, and other imaging parameters SIFT Features
Advantages of invariant local features • Locality: features are local, so robust to occlusion and clutter (no prior segmentation) • Distinctiveness: individual features can be matched to a large database of objects • Quantity: many features can be generated for even small objects • Efficiency: close to real-time performance • Extensibility: can easily be extended to wide range of differing feature types, with each adding robustness
Zhang, Deriche, Faugeras, Luong (95) • Apply Harris corner detector • Match points by correlating only at corner points • Derive epipolar alignment using robust least-squares
Cordelia Schmid & Roger Mohr (97) • Apply Harris corner detector • Use rotational invariants at corner points • However, not scale invariant. Sensitive to viewpoint and illumination change.
Scale invariance Requires a method to repeatably select points in location and scale: • The only reasonable scale-space kernel is a Gaussian (Koenderink, 1984; Lindeberg, 1994) • An efficient choice is to detect peaks in the difference of Gaussian pyramid (Burt & Adelson, 1983; Crowley & Parker, 1984 – but examining more scales) • Difference-of-Gaussian with constant ratio of scales is a close approximation to Lindeberg’s scale-normalized Laplacian (can be shown from the heat diffusion equation)
Key point localization • Detect maxima and minima of difference-of-Gaussian in scale space • Fit a quadratic to surrounding values for sub-pixel and sub-scale interpolation (Brown & Lowe, 2002) • Taylor expansion around point: • Offset of extremum (use finite differences for derivatives):
Sampling frequency for scale More points are found as sampling frequency increases, but accuracy of matching decreases after 3 scales/octave
Eliminating unstable keypoints • Discard points with DOG value below threshold (low contrast) • However, points along edges may have high contrast in one direction but low in another • Compute principal curvatures from eigenvalues of 2x2 Hessian matrix, and limit ratio (Harris approach):
Select canonical orientation • Create histogram of local gradient directions computed at selected scale • Assign canonical orientation at peak of smoothed histogram • Each key specifies stable 2D coordinates (x, y, scale, orientation)
Example of keypoint detection Threshold on value at DOG peak and on ratio of principle curvatures (Harris approach) • (a) 233x189 image • (b) 832 DOG extrema • (c) 729 left after peak • value threshold • (d) 536 left after testing • ratio of principle • curvatures
Edelman, Intrator & Poggio (97) showed that complex cell outputs are better for 3D recognition than simple correlation Creating features stable to viewpoint change
Classification of rotated 3D models (Edelman 97): Complex cells: 94% vs simple cells: 35% Stability to viewpoint change
SIFT vector formation • Thresholded image gradients are sampled over 16x16 array of locations in scale space • Create array of orientation histograms • 8 orientations x 4x4 histogram array = 128 dimensions
Feature stability to noise • Match features after random change in image scale & orientation, with differing levels of image noise • Find nearest neighbor in database of 30,000 features
Feature stability to affine change • Match features after random change in image scale & orientation, with 2% image noise, and affine distortion • Find nearest neighbor in database of 30,000 features
Distinctiveness of features • Vary size of database of features, with 30 degree affine change, 2% image noise • Measure % correct for single nearest neighbor match
Nearest-neighbor matching to feature database • Hypotheses are generated by matching each feature to nearest neighbor vectors in database • No fast method exists for always finding 128-element vector to nearest neighbor in a large database • Therefore, use approximate nearest neighbor: • We use best-bin-first (Beis & Lowe, 97) modification to k-d tree algorithm • Use heap data structure to identify bins in order by their distance from query point • Result: Can give speedup by factor of 1000 while finding nearest neighbor (of interest) 95% of the time
Detecting 0.1% inliers among 99.9% outliers • Need to recognize clusters of just 3 consistent features among 3000 feature match hypotheses • LMS or RANSAC would be hopeless! • Generalized Hough transform • Vote for each potential match according to model ID and pose • Insert into multiple bins to allow for error in similarity approximation • Using a hash table instead of an array avoids need to form empty bins or predict array size
Probability of correct match • Compare distance of nearest neighbor to second nearest neighbor (from different object) • Threshold of 0.8 provides excellent separation
Model verification • Examine all clusters in Hough transform with at least 3 features • Perform least-squares affine fit to model. • Discard outliers and perform top-down check for additional features. • Evaluate probability that match is correct • Use Bayesian model, with probability that features would arise by chance if object was not present • Takes account of object size in image, textured regions, model feature count in database, accuracy of fit (Lowe, CVPR 01)
Solution for affine parameters • Affine transform of [x,y] to [u,v]: • Rewrite to solve for transform parameters:
Models for planar surfaces with SIFT keys Planar texture models
Planar recognition • Planar surfaces can be reliably recognized at a rotation of 60° away from the camera • Affine fit approximates perspective projection • Only 3 points are needed for recognition
3D Object Recognition • Extract outlines with background subtraction
3D Object Recognition • Only 3 keys are needed for recognition, so extra keys provide robustness • Affine model is no longer as accurate
Test of illumination invariance • Same image under differing illumination 273 keys verified in final match
Robot Localization • Joint work with Stephen Se, Jim Little
Recognizing Panoramas Matthew Brown and David Lowe (ICCV 2003) • Recognize overlap from an unordered set of images and automatically stitch together • SIFT features provide initial feature matching • Image blending at multiple scales hides the seams Panorama of our lab automatically assembled from 143 images
Why Panoramas? • Are you getting the whole picture? • Compact Camera FOV = 50 x 35°
Why Panoramas? • Are you getting the whole picture? • Compact Camera FOV = 50 x 35° • Human FOV = 200 x 135°
Why Panoramas? • Are you getting the whole picture? • Compact Camera FOV = 50 x 35° • Human FOV = 200 x 135° • Panoramic Mosaic = 360 x 180°
Bundle Adjustment • New images initialised with rotation, focal length of best matching image
Bundle Adjustment • New images initialised with rotation, focal length of best matching image
Multi-band Blending • Burt & Adelson 1983 • Blend frequency bands over range l
2-band Blending Low frequency (l > 2 pixels) High frequency (l < 2 pixels)