290 likes | 443 Views
Learning Globally-Consistent Local Distance Functions for Shape-Based Image Retrieval and Classification. Andrea Frome, Yoram Singer, Fei Sha, Jitendra Malik. Goal. Nearest neighbor classification. D ( , ). Learning a Distance Metric from Relative Comparisons.
E N D
Learning Globally-Consistent Local Distance Functions for Shape-Based Image Retrieval and Classification Andrea Frome, Yoram Singer, Fei Sha, Jitendra Malik
Nearest neighbor classification D ( , )
Learning a Distance Metric from Relative Comparisons [Schulz & Joachims, NIPS ’03] D ( , ) D ( , ) = D ( , ) ( - )T ( - )
Approach image i image j
Approach image i dji,m image j
image k Approach image i Dji =Σ wj,mdji,m image j
image k Approach image i < Dji Dki image j
image i < Dji Dki image j image k Core wj,m ? image j
Derivations • Notation • Large-margin formulation • Dual problem • Solution
Dji =Σ wj,mdji,m Dji =wj·dji Dki > Dji wk·dki > wj ·dji wk·dki - wj ·dji ≥ 1 W w1w2…wk…wj… Xijk 0 0 … dki…-dji… wk·dki - wj ·dji ≥ 1 W·Xijk≥ 1 Notations for triplet i, j, k
Details – Features and descriptors • Find ~400 features per image • Compute geometric blur descriptor
Descriptors • Geometric blur
Descriptors • Two sizes of geometric blur (42 pixels and 70 pixels) • Each is 204 dimensions (4 orientations and 51 samples each) • HSV histograms of 42-pixel patches
Choosing triplets • Caltech101 – at 15 images per class • 31.8 million triplets • Many are easy to satisfy • For each image j, for each feature • Find the N images I with closest features • For each negative example iin I, form triplets (j, k, i) • Eliminates ~ half of triplets
Choosing C • Train with multiple values of C, testing on a held-out part of the training set • Choose whichever gives the best results • For each C, run online version of the training algorithm • Make one sweep through training triplets • For each misclassified triplet (i,j,k), update weights for the three images • Choose C which gets the most right answers
Results • At 15 training examples per class: 63.2% (~3% improvement) • At 20 training examples per class: 66.6% (~5% improvement)
Results • Confusion matrix Hardest categories: crocodile, cougar_body, cannon, bass
Questions • Is there any disadvantage to a non-metric distance function? • Could the images be embedded in a metric space? • Why not learn everything? • Include a feature for each image pixel • Include multiple types of descriptors • Could this be used for to do unsupervised learning for sets of tagged images (e.g., for image segmentation)? • Can you learn a single distance per class?