120 likes | 267 Views
Efficient Visual Search of Videos Cast as Text Retrieval. Josef Sivic and Andrew Zisserman PAMI 2009 Presented by: John Paisley, Duke University. Outline. Introduction Text retrieval review Object retrieval in video Experiments Conclusion. Introduction.
E N D
Efficient Visual Search of Videos Cast as Text Retrieval Josef Sivic and Andrew Zisserman PAMI 2009 Presented by: John Paisley, Duke University
Outline • Introduction • Text retrieval review • Object retrieval in video • Experiments • Conclusion
Introduction • Goal: Retrieve objects in a video database similar to a queried object. • This work aims to cast this problem as a text retrieval problem. • In text retrieval, each document is an object and each word is given an index. Each document then is represented by a vector of the counts of each word. • Can we treat video the same way? Each frame is treated as a document. Multiple feature vectors are extracted from a single frame. These are quantized, with the quantized values then being treated as a word. • Text retrieval algorithms can the be used.
Text Retrieval • As mentioned, each document is represented by a vector. The standard way of obtaining this vector is via “term frequency-inverse document frequency” (tf-idf). • Document retrieval then proceeds as follows, where documents are sorted in descending order. • If these vectors are normalized, the Euclidean distance can be used.
Object Retrieval in Video: Viewpoint Invariant Description • Goal: Extract description of an object that is unaffected by viewpoint, scale and illumination, etc. • To do this, for each frame, use segmentation algorithms to define regions of interest (two are used here). Roughly 1,200 regions are computed for each frame. Each region is represented as a 128 dimensional vector using the SIFT descriptor method. • To get rid of bogus regions, they are tracked over a few frames to make sure that the regions are stable, and therefore potentially interesting. This reduces the number of feature vectors to about 600 per frame.
Object Retrieval in Video: Building a Visual Vocabulary • Now represent each frame by roughly a 128 x 600 matrix. • To go from images to words, build a global dictionary using VQ (e.g., K-means) and quantize feature vectors. In this paper, K-means is used using the Mahalanobis distance. • These clusters are found separately for each segmantation algorithm. In all, the authors use 16,000 clusters (or words). • Each frame is now represented as a 16,000 vector of counts of the number of observations in each cluster. Words that arise freqently in documents are thrown out as stop words.
Object Retrieval in Video: Spatial Consistency • Given a queried object, there’s information in the spatial relationships of the region of interest that can help the ranking. • This is done by first returning results using text retrieval algorithm discussed before and then re-ranking by looking at how similar the K-nearest neighbors are
Object Retrieval Process • Feature length film usually has 100K-150K frames. Use one frame per second to reduce to 4K-6K frames. • Features are extracted and quantized as discussed. • The user selects a query region. “Words” are extracted as well as spatial relationships. • A desired number of frames are returned using the text retrieval algorithm and re-ranked using the spatial consistency method.
Experiments • Results using the movies “Groundhog Day,” “Run Lola Run” and “Casablanca” • Six objects of interest were selected and searched for. • An additional benefit of the proposed method is speed.
Experiments • Fig. 16 shows the effect of vocabulary size. • Table 2 shows the effect of building the • dictionary using the right, wrong and two movies • Table 3 shows the combination of the two
Conclusion • Vector quantization does not seem to degrade performance, while the speed is significantly faster. • Using spatial information via spatial consistency reranking was shown to significantly improve results. • This can be extended to temporal information as well.