Indexing and Retrieval

Indexing and Retrieval James Hill, Ozcan Ilikhan, Mark Lenz{jshill4, ilikhan, mlenz} @cs.wisc.edu 1

Presentation Outline 1- Introduction 2- Common methods used in the papers * SIFT descriptor * k-means clustering * TF-IDF weight 3- Video Google 4- Scalable Recognition with a Vocabulary Tree 5- City-Scale Location Recognition 2

Introduction Find identical objects in multiple images Difficulties with changes in • Scale • Orientation • Viewpoint • Lighting Search time and storage space 3

Indexing and Retrieval Common Solutions Invariant features (e.g. SIFT) kd-trees Best Bin First 4

SIFT - Scale-Invariant Feature Transform Key Steps Difference of Gaussians in scale space Maxima and minima are feature points Remove low-contrast and non-robust edge points Assign each point an orientation Create a descriptor from windowed region 5

SIFT - Scale-Invariant Feature Transform Key Benefits • Feature points invariant to scale and translation • Orientations provide invariance to rotation • Distinctive descriptors are partially invariant to changes in illumination and viewpoint • Robust to background clutter and occlusion 6

k-means clustering Motivation (what are we trying to do) We want to develop a method for finding the centers of different clusters in a set of data. 7

k-means clustering 8

k-means clustering How do we find these means? We need to perform a minimization on: 12

k-means clustering How do we extend this? With Hierarchical k-means Clustering! 13

k-means clustering Now that we can cluster our data, how can we use this information to quickly find the closest vector in our data given some test vector? 17

k-means clustering We will build a vocabulary tree using this clustering method. Each vector in our data (including the means) will be considered a “word” in our vocabulary. We will build a tree using the means of our data. 18

TF-IDF Term frequency–inverse document frequency (tf–idf): is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. A standard weight often used in information retrieval and text mining. 22

TF-IDF nid : the number of occurrences of word i in document d. nd : the total number of words in document d. Ni : the number of documents containing term i. N : the total number of documents in the whole database. 23

TF-IDF word frequency inverse document frequency X Each document is represented by a vector Then vectors are organized as an inverted file. 24

TF-IDF Image credit: http://www.lovdata.no/litt/hand/hand-1991-2.html 25

Video Google A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew ZissermanVisual Geometry Group, Department of Engineering Science University of Oxford, United KingdomProceedings of the International Conference on Computer Vision (2003)

Video Google Efficient Visual Search of Videos Cast as Text Retrieval IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 31, Number 4, page 591--606, 2009 Fundamental idea of paper: Retrieve key frames and shots of a video containing a particular object with ease, speed, and accuracy with which Google retrieves text documents (web pages) containing particular words.

Video Google Recall Text Retrieval (preprocessing) Parse documents into words Stemming: “walk" = { “walk,” “walking,” “walks”,…} Stop list to reject very common words , such as “the” and “an”. Each document is represented by a vector with components given by the frequency of occurrence of the words the document contains Store vector in an inverted file.

Video Google Can we treat video the same way? What and where are the words of a video?

Video Google The Video Google algorithm: a) Pre-processing (off-line): Detect affine covariant regions in each key frame of video Reject unstable regions. Build visual vocabulary Remove stop listed words Compute weighted document frequency Build the index (inverted file).

Video Google Building a Visual Vocabulary Step 1. Calculate viewpoint invariant regions:Shape Adapted (SA) region: centered on corner-like features Maximally Stable (MS) region: correspond to blobs of high contrast with respect to their surroundings such as a dark window on a gray wall. 720 x 576 pixel video frame ≈ 1200 regions Each region is represented by a 128-dimentional vector using SIFT descriptor

Video Google

Video Google Step 2. Reject unstable regions:Any region that does not survive for more than 3 frames is rejected. This “stability check” significantly reduces the number of regions to about 600 per frame.

Video Google Step 3. Build Visual Vocabulary:Use K-Means clustering to vector quantize descriptors into clusters Mahalanobis distance:

Video Google Step 4. Remove stop-listed visual words:The most frequent visual words that occur in almost all images, such as highlights which occur in many frames, are rejected.

Video Google Step 5. Compute tf-idf weighted document frequency vector:Variations of tf-idf may be used. Step 6. Build inverted-file indexing structure:

Video Google The Video Google algorithm: b) Run-time (on-line): Determine the set of visual words within the query region Retrieve keyframes based on visual word frequencies Re-rank the top keyframes using spatial consistency

Video Google Spatial consistency: Matched covariant regions in the retrieved frames should have a similar spatial arrangement to those of the outlined region in the query image.

Video Google How it works: Query region and its close-up.

Video Google How it works: Original matches based on visual words

Video Google How it works: Matches after using the stop-list

Video Google How it works: Final set of matches after filtering on spatial consistency

Video Google

Video Google Real-time demo

Scalable Recognition With a Vocabulary Tree James Hill, Ozcan Ilikhan, Mark Lenz{jshill4, ilikhan, mlenz} @cs.wisc.edu 47

The Paper Scalable Recognition with a vocabulary tree David Nister and Henrik Stewenius Center for Visualization and Virtual Environments Department of Computer Science, University of Kentucky Published in 2006 Appeared in: 2006 IEEE Computer Science Conference on Computer Vision and Pattern Recognition 48

What are we trying to do. Provide an indexing scheme that: Scales to large image databases (1 million). Retrieves images in an acceptable amount of time. 49

Inspiration Sivic and Zisserman (what you just saw) Used k-means to partition the descriptors in several pictures. Used TF-IDF to score an image and find a close match. 50

Indexing and Retrieval