Distinctive Image Features from Scale-Invariant Keypoints

Distinctive Image Features from Scale-Invariant Keypoints Ronnie Bajwa Sameer Pawar * * Adapted from slides found online by Michael Kowalski, Lehigh University

Goal • Extract distinctive invariant features from an image that can be used to perform reliable object recognition and many other applications.

Key requirements for good feature • Highly distinctive, with a low probability of mismatch • Should be easy to extract • Invariance: - Scale and rotation - change in illumination - slight change in viewing angle - image noise - clutter and occlusion

Brief History • Moravec(1981): • corner detector. • Harris(1988): • Selects locations that has large gradients in all directions (corners) • Invariant to image rotation and to slight affine intensity change, but faces problems for scale change. • Zhang(1995): • Introduced correlation window to match images in large range motion. • Schmid (1997): • Suggested use of local feature matching for image recognition. • Used rotationally invariant descriptor of the local image region. • Multiple feature matches accomplish recognition under occlusion & clutter.

Harris Corner Detector: direction of the fastest change direction of the slowest change (max)-1/2 (min)-1/2 “Edge” 2 >> 1 “Corner” 2 ~ 1 ~ Large “Flat” 2 ~ 1~0

SIFT Algorithm Overview Filtered approach • Scale-space extrema detection • Identify potential points: invariant to scale & orientation. • Difference-of-Gaussian function • Keypoint localization • Improve the estimate for location by fitting a quadratic • Extremathresholdedfor filter out insignificant points. • Orientation Assignment • Orientation assigned to each keypoint and neighboring pixels based on local gradient. • Keypoint Descriptor construction • Feature vector based on gradients of local neighborhood

Keypoint Selection: Scale space • We express the image at different scales by filtering it with a Gaussian kernel

Keypoint Selection: why DoG’s? • Lindeberg(1994) and Mikolajczyk (2002) found that the maxima and minima of the scaled Laplacian provides the most stable scale invariant features • We can use the scaled images to approximate this: • Efficient to compute • Smoothed images L needed later so D can be computed by simple image subtraction

Back to picking keypoints • Supersample original image • Compute smoothed images using different scales σ for entire octave • Compute doG images from adjacent scales for entire octave • Isolate keypoints in each octave by detecting extrema in doG compared to neighboring pixels • Subsample image 2σ of current octave and repeat process (2-3) for next octave

Visual representation of process

Visual representation of process (cont.)

A More Visual Representation Original Image Starting Image

A More Visual Representation (cont.) First Octave of L images

A More Visual Representation (cont.) Second Octave Third Octave Fourth Octave

A More Visual Representation (cont.) First Octave difference-of-Gaussians

A More Visual Representation (cont.)

Scale space sampling • How many fine scales in every octave? • Extremas can be arbitrary close but very close ones are unstable. • After subsampling and before finding scaled images of the octave, prior smoothing of 1.6 is done • To compensate the loss of higher spatial frequencies, original image is doubled in size

Accurate Keypoint Localization • From difference-of-Gaussian local extrema detection we obtain approximate values for keypoints • Originally these approximations were used directly • For an improvement in matching and stability fitting to a 3D quadratic function is used

The Taylor Series Expansion • Take Taylor Series Expansion of scale-space function D(x,y,σ) • Use up to quadratic terms • origin shifted to sample point • offset from this sample point • to find location of extremum, take derivative and set to 0

Thresholding Keypoints (part 1) • The function value at the extrema is used to reject unstable extrema • Low contrast • Evaluate • Absolute value less than 0.03 at extrema location results in discarding of extrema

Thresholding Keypoints (part 2) • Difference-of-Gaussian function will be strong along edges • Some locations along edges are poorly determined and will become unstable when even small amounts of noise are added • These locations will have a large principal curvature across the edge but a small principal of curvature perpendicular to the edge • Therefore we need to compute the principal curvatures at the location and compare the two

Computing the Principal Curvatures • Hessian matrix • The eigenvalues of H are proportional to principal curvatures • We are not concerned about actual values of eigenvalue, just the ratio of the two

Stages of keypoint selection

Assigning an Orientation • We finally have a keypoint that we are going to keep • The next step is assigning an orientation for the keypoint • Used in making the matching technique invariant to rotation

Assigning an Orientation (cont.) • Gaussian smoothed image, L, with closest scale is chosen (scale invariance) • Points in region around keypoint are selected and magnitude and orientations of gradient are calculated • Orientation histogram formed with 36 bins. Sample is added to appropriate bin and weighted by gradient magnitude and Gaussian-weighted circular window with a of σ 1.5 times scale of keypoint

Assigning an Orientation (cont.) • Highest peak in orientation is found along with any other peaks within 80% of highest peak • 3 closest histogram values to each peak are used to interpolate (fit to a parabola) a better accurate peak • As of now each keypoint has 4 dimesions: x location, y location, scale, and orientation

Keypoint Descriptor • Calculated using a region around the keypoint as opposed to directly from the keypoint for robustness • Like before, magnitudes and orientations are calculated for points in region around keypoint using L of nearest scale • To ensure orientation invariance the gradient orientations and coordinates of descriptor are rotated relative to orientation of keypoint • Provides invariance to changes in illumination and 3D camera viewpoint

Visual Representation of Keypoint Descriptor

Keypoint Descriptor (cont.) • Magnitude of each point is weighted with σ of one half the width of the descriptor window • Stops sudden changes in descriptor due to small changes in position of the window • Gives less weight to gradients far from keypoint • Samples are divided into 4x4 subregions around keypoint. Allows for a change of up to 4 positions of a sample while still being included in the same histogram

Keypoint Descriptor (cont.) • Avoiding boundary effects between histograms • Trilinear interpolation used to distribute value of gradient of each sample into adjacent histogram bins • Weight equal to 1 – d , where d is the distance of a specific sample to the center of a bin • Vector normalization • Done at the end to ensure invariance to illumination change (affine) • Entire vector normalized to 1 • To combat non-linear illumination (camera saturation) changes values in feature vector are thresholded to no larger than 0.2 and then the vector is normalized.

We now have features • Up to this point we have: • Found rough approximations for features by looking at the difference-of-Gaussians • Localized the keypoint more accurately • Removed poor keypoints • Determined the orientation of a keypoint • Calculated a 128 feature vector for each keypoint • What do we do now?

Matching Keypoints between images • Tale of two images (or more) • One image is the training sample of what we are looking for • The other image is the world picture that might contain instances of the training sample • Both images have features associated with them across different octaves • How do we match features between the two?

Matching Keypoints between images (cont.) • Nearest Neighbor Algorithm(Euclidean Distance) • Independently match all keypoints in all octaves in one image with all keypoints in all octaves in other image • How to solve problem of features that have no correct match with opposite image (i.e. background noise, clutter) • Global threshold on distance (did not perform well) • Ratio of closest neighbor with second closest neighbor.

Matching Keypoints between images (cont.) • Threshold at 0.8 • Eliminates 90% of false matches • Eliminates less than 5% of correct matches

Efficient Nearest Neighbor indexing • In high dimensions no algorithms • k-dimensional tree: Is O(2klog n).

Efficient Nearest Neighbor indexing • Use an approximate algorithm called Best-Bin-First (feature space) • Bins are searched in order of their closest distance . • Heap-based priority queue. • Returns closest neighbor with high probability • Search cut off after searching the 200 nearest-neighbor candidates • Works well since we only consider matches where nearest neighbor is less than 0.8 times the distance to second nearest neighbor.

Clustering -Hough Transform • Small \ Highly-occluded objects: • 3 feature matches are sufficient. • 99% outliers. • RANSAC doesn’t perform well due to large number of outliers. • Hough Transform • Uses each feature to vote for all object poses consistent with the feature • Predicts approx. model for similarity transformation. • Bin Sizes must be large to account for large error bounds • Orientation-30 degrees, Scale-of 2, Location-0.25. • Each keypoint votes to two closest bins in each dimension/ • 16 entries for each hypothesis (feature).

Model Verification-Affine Transformation • Geometric Verification of clusters

Results (Object Recognition)

Results (cont.)

Questions?

Distinctive Image Features from Scale-Invariant Keypoints