Learning to Match Images in Large-Scale Collections

Learning to Match Images in Large-Scale Collections • Song Cao and Noah Snavely • Cornell University Workshop on Web-scale Vision and Social Media, ECCV 2012

A key problem in Web-scale vision is to discover visual connectivity among a large set of images Trafalgar Dataset: 6981 images

A key problem in Web-scale vision is to discover visual connectivity among a large set of images

An example connectivity graphhttp://landmark.cs.cornell.edu/Landmarks3/0001/3951.0/graphview.html

Background • This task requires determining whether any two images overlap or not - image matching

Background • Image matching: • SIFT feature extraction, finding nearest-neighbor features and apply RANSAC methods for all pairs of images • high accuracy, but high computational cost • Brute force (O(n2)) approach (20 pairs / sec):250,000 images ~ 31 billion image pairs; 1 year on 50 machines1,000,000 images ~ 500 billion pairs; 15 years on 50 machines • However, only a small fraction of all possible image pairs actually match (e.g. < 0.1% for city-sized datasets)

Goal • How can we classify image pairs in to matching and non-matching both quickly and accurately? Matching Non-matching

Bag-of-words Model • Widely used in image retrieval, serving as an approximate image similarity measure • Efficient and scalable in retrieval thanks to quantization and inverted files • Useful in choosing promising (similar) image candidates before matching to increase efficiency [1][2] • Usually uses tf-idf weighting as in text retrieval • Inverse Document Frequency (IDF) of word j = [1] Agarwal, S., Snavely, N., Simon, I., Seitz, S., Szeliski, R.: Building Rome in a day. In: ICCV. (2009) [2] Philbin, J., Sivic, J., Zisserman, A.: Geometric latent dirichlet allocation on a matching graph for large-scale image datasets. IJCV (2010)

Bag-of-words Model • However, BoW similarity measure can be noisy • [Example] • TateModern dataset • ~120K randomly chosen testing image pairs • Average Precision: 0.458

Main Idea • Some visual words are more reliable than others for a given dataset • Better weights on visual words may increase prediction accuracy • Our approach: • Apply discriminative learning techniques to improve prediction accuracy and hence matching efficiency • Training data comes from matching itself (iterative learning and matching)

Weighting in BoW Model • Unsupervised approaches • tf-idf weighting • Burstiness [1] • Co-occurring set (“co-ocset”) [2] • Supervised approaches • Learning a fine vocabulary [3] • Selecting important features by matching [4] [1] Jegou, H., Douze, M., Schmid, C.: On the burstiness of visual elements. In: CVPR. (2009) [2] Chum, O., Matas, J.: Unsupervised discovery of co-occurrence in sparse high dimensional data. In: CVPR. (2010) [3] Mikulik, A., Perdoch, M., Chum, O., Matas, J.: Learning a fine vocabulary. In: ECCV. (2010) [4] Turcot, P., Lowe, D.: Better matching with fewer features: The selection of useful features in large database recognition problems. In: Workshop on Emergent Issues in Large Amounts of Visual Data, ICCV. (2009)

Our Approach Learn an SVM classifier for weighting with positive (matching) and negative (non-matching) image pairs

Our Approach Where does training data come from?

Iterative Learning & Matching • Given a collection of images (represented as BoW histogram vectors): • 1. Find a number of image pairs with high similarities (tfidf similaritiesin the initial round; learned similarities in later rounds) • 2. Apply image matching on them • 3. Perform learning using matching results and obtain new similarity measure • 4. Repeat from 1 until done

Our Approach Non-matching Matching

Learning Formulation • For a pair of images (a,b), define their similarity as • (W is a diagonal matrix) • Goal: learn a weighting W that best separates matching pairs from non-matching pairs • Label for matching pairs (a,b); for non-matching pairs (a’, b’) • Feature vector: for all (a,b), • S: Set of training pairs (a,b) • We use L2 regularized L2-loss SVMs, which optimize

Learning Formulation • We learn a linear classifier, but interpret its score as a similarity measure • Score histograms of matching vs. non-matching pairs are better separated (e.g. TateModern Dataset) • Our model is learned with ~100K example pairs; ~120K randomly chosen testing image pairs AP:0.458 AP:0.704 AP:0.966

Two Extensions • Insufficient training data (during early stage) might cause over-fitting; we propose two extensions: • 1. Negative examples from other datasets helps • 2. A modified regularization for SVMs that uses the tf-idf weights as a prior • Intuition: regularize s.t. weights should be close to a set of “safe” prior weights (e.g. tf-idf) • Recall SVM formulation • Substitute with , where denotes a prior weight vector such as tf-idf weights

Datasets Datasets: 5 Flickr images sets (several thousand images each) + Oxford5K and Paris Trafalgar 6,981 images LondonEye 7,047 images TateModern 4,813 images SanMarco 7,792 images TimeSquare 6,426 images

Experiments • Experiment 1: test how well similarity learning works, measured by mAP scores of ranking other images in the set. • 50 test images from each dataset Test images don’t appear in training image pairs

Experiments Experiment 1: test how well similarity learning works, measured by mAP scores of ranking other images in the set. Example: SanMarco dataset

Experiments • Experiment 1: test how well similarity learning works, measured by mAP scores of ranking other images in the set. • ~ 50 test images for each dataset • Test images don’t appear in training image pairs Oxford5K and Paris each encompass several disparate landmarks, they require more training data, hence modified regularization is essential

Experiments • Experiment 2: Test how much improvement in efficiency in matching images. • Efficiency measured by match success rate: percentage of the image pairs we try to match that turn out to be true matches match success rate(%) iteration number

Experiments • Experiment 2: system evaluation. Test how much improvement in efficiency in matching images. • Efficiency is measured by match success rate: percentage of the image pairs we try to match that turn out to be true matches

Experiments • Number of true matches found as a function of time

Conclusions • Even with small amounts of training data, our approach can predict matching and non-matching image pairs significantly better than tf-idf and co-ocset methods • Overall matching efficiency improved by more than a factor of two • Positive examples are quite specific to different datasets; negative examples could be shared across datasets

Limitations • Good classification for canonical images in a dataset, but worse results for rarer ones (due to uneven amounts of training data for different images)

Thank you!Questions?http://www.cs.cornell.edu/projects/matchlearn/

Learning to Match Images in Large-Scale Collections

Learning to Match Images in Large-Scale Collections

Presentation Transcript

Large-scale Machine Learning using DryadLINQ

EMERGING SYSTEMS FOR LARGE-SCALE MACHINE LEARNING

Matching Images to Words for Large-Scale Datasets

Large-Scale Machine Learning at Twitter

Efficient Large-Scale Structured Learning

Landmark Classification in Large-scale Image Collections ICCV 09

Large-scale Machine Learning using DryadLINQ

Match the images:

Maturity in Large Scale Corporate e-Learning

LARGE SCALE

Large Scale Discovery of Spatially Related Images

LARGE-SCALE DISTANCE LEARNING INITIATIVES

Designing Large-Scale Assessment to Support Learning

Large-Scale Machine Learning: SVM

Landmark Classification in Large-scale Image Collections

Developments in large scale physics

Large scale

Large-scale organisations in context

Introduction to Large Scale Change

in Large-Scale Cluster

Why large scale machine learning platform is important to Scale AI?

Large-Scale Deep Learning With TensorFlow