320 likes | 497 Views
Lost in Quantization: Improving Particular Object Retrieval in Large Scale Image Databases. CVPR 2008 James Philbin Ondˇrej Chum Michael Isard Josef Sivic Andrew Zisserman.
E N D
Lost in Quantization: Improving Particular Object Retrieval in Large Scale Image Databases CVPR 2008 James Philbin Ondˇrej Chum Michael Isard Josef Sivic Andrew Zisserman [7] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total recall: Automatic query expansion with a generative feature model for object retrieval. In Proc. ICCV, 2007.
Outline • Introduction • Methods in this paper • Experiment & Result • Conclusion
Outline • Introduction • Methods in this paper • Experiment & Result • Conclusion
Introduction • Goal • Specific object retrieval from an image database • For large database • It’s achieved by systems that are inspired by text retrieval (visual words).
Flow • Get features • SIFT • Cluster • Approximate k-means • Feature quantization • Visual word • Soft-assignment (query) • Re-ranked • RANSAC • Query expansion • Average query expansion
Outline • Introduction • Methods in this paper • Experiment & Result • Conclusion
Feature • SIFT
Quantization (visual word) • Point List = [(2,3), (5,4), (9,6), (4,7), (8,1), (7,2)] • Sorted List = [(2,3), (4,7), (5,4), (7,2), (8,1),(9,6)]
Soft-assignment of visual words • Matching two image features in bag-of-visual-words in hard-assignment • Yes if assigned to the same visual word • No otherwise • Sort-assignment • A weighted combination of visual words
Soft-assignment of visual words A~E represent cluster centers (visual words) points 1–4 are features
Soft-assignment of visual words • d is the distance from the cluster center to the descriptor • In practice is chosen so that a substantial weight is only assigned to few cells • The essential parameters • the spatial scale • r, nearest neighbors considered
Soft-assignment of visual words • the weights to the r nearest neighbors, the descriptor is represented by an r-vector, which is then L1 normalized
TF–IDF weighting • Standard index architecture
TF–IDF weighting • tf • 100 vocabularies in a document, ‘a’ 3 times • 0.03 (3/100) • idf • 1,000 documents have ‘a’, total number of documents 10,000,000 • 9.21 ( ln(10,000,000 / 1,000) ) • if-idf = 0.28( 0.03 * 9.21)
TF–IDF weighting • In this paper • For the term frequency(tf) • we simply use the normalized weight value for each visual word. • For the inverse document(idf) • feature measure, we found that counting an occurrence of a visual word as one, no matter how small its weight, gave the best results
Re-ranking • RANSAC • Affine transform Θ : Y = AX+b • Algorithm • 1. Randomly choose n points • 2. Use n points to find Θ • 3. Input N-n points to Θ • 4. How many inlier • Repeat 1~4 K times • Pick the best Θ
Re-ranking • In this paper • No only counting the number of inlier correspondences ,but also scoring function, or cosine =
Average query expansion • Obtain top (m < 50) verified results of original query • Construct new query using average of these results • where d0 is the normalized tf vector of the query region • di is the normalized tf vector of the i-th result • Requery once
Outline • Introduction • Methods in this paper • Experiment & Result • Conclusion
Dataset • Crawled from Flickr & high resolution(1024x768) • Oxford buildings • About 5,062 high resolution(1024x768) images • using 11 landmarks as queries • Paris • Used for quantization • 6,300 images • Flickr1 • 145 most popular tags • 99,782 images
Dataset • Query • 55 queries: 5 queries for each of 11 landmarks
Baseline • Follow the architecture of previous work [15] • A visual vocabulary of 1M words is generated using an approximate k-means [15] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In Proc. CVPR, 2007
Precision Recall Evaluation • Compute Average Precision (AP) score for each of the 5 queries for a landmark • Area under the precision-recall curve • Precision = RPI / TNIR • Recall = RPI / TNPC RPI = retrieved positive images TNIR = total number of images retrieved TNPC = total number of positives in the corpus • Average these to obtain a Mean Average Precision (MAP)
Evaluation • Dataset • Only the Oxford (D1) 5,062 images • Oxford (D1) + Flickr1 (D2) 104,844 images • Vector quantizers • Oxford or Paris
Result Parameter variation Comparison with other methods [15] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. CVPR, 2007. [14] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. CVPR, 2006. [18] T. Tuytelaars and C. Schmid. Vector quantizing feature space with a regular lattice. ICCV, 2007.
Result Spatial verification Effect of vocabulary size
Result Query expansion Scaling-up to 100K images
Result ashmolean_3 goes from 0.626 AP to 0.874 AP christ_church_5 increases from 0.333 to 0.813 AP
Outline • Introduction • Methods in this paper • Experiment & Result • Conclusion
Conclusion • A new method of visual word assignment was introduced: • descriptor-space soft-assignment • It improves that descriptor lost in the quantization step of previously published methods.