220 likes | 349 Views
Video Fingerprinting: Features for Duplicate and Similar Video Detection and Query-based Video Retrieval. Anindya Sarkar, Pratim Ghosh, Emily Moxley and B. S. Manjunath Presented by: Anindya Sarkar Vision Research Lab, Department of Electrical & Computer Engg,
E N D
Video Fingerprinting: Features for Duplicate and Similar Video Detection and Query-based Video Retrieval Anindya Sarkar, Pratim Ghosh, Emily Moxley and B. S. Manjunath Presented by: Anindya Sarkar Vision Research Lab, Department of Electrical & Computer Engg, University of California, Santa Barbara Januray 30, 2008
Problem Definition: • Duplicate video and similar video detection • we represent a video compactly (fingerprint), for efficient storage and faster search without compromising the retrieval accuracy • Query-based video retrieval • Input: short length (1-2% of big video length) query video • Output: actual “big” video from which the query is taken
Generation of Duplicate Videos • Dataset: BBC rushes dataset, provided for the TRECVID-2007 task of video summarization • Operations performed: • Image processing (per frame) based: • Blurring using 3x3 and 5x5 window • Gamma correction by 20% and -20% • Gaussian noise addition at SNR of -20,0,10,20,30 and 40 dB • JPEG compression at QF=10,30,50,70 and 90 • Frame drop based errors: • frame drops of 20%, 40% and 60% of the original video for both random and bursty case.
Interpretation of Similar videos • Different takes of the same scene are considered as “similar” videos • These videos are similar in content • However, due to human variability at both the cameraman and actor level, (camera angles, cuts, and actor performance), videos may look similar but are still different • BBC rushes dataset has unedited footage of the different retakes – hence, ideally suited for generation of similar videos
K x d N frames in the actual video K key-frames Video Summarization and key-frame extraction Video Fingerprint d-dimensional signature computed per key-frame Keyframe based Video Fingerprint • Features used for fingerprint creation: • 1. Compact Fourier Mellin Transform • 2. Scale Invariant Feature Transform
Log-Polar Transformation Any 2-D Matrix m,n=0 M-1 x=em∆rcos(n∆θ) y=em∆rsin(n∆θ) R origin (m,n) ∆r ∆θ (x,y) N-1 First fix the value of M,N ∆r= log(R)/M, ∆θ=2π/N M is the no of concentric circles . N is the no. of diverging radial lines . R is the maximum radius of in-circle
CFMT FEATURE EXTRACTION -(K-1) K-1 -(V-1) m, n=0 M-1 |FFT| 50% A.C. Energy Normalization & vectorization V-1 N-1 PCA Quantization
SIFT Feature • Generally used for object recognition – hence, can be used as an image similarity measure • Distance between SIFT features – number of descriptor comparisons makes it computationally prohibitive • Speed up – quantize descriptors to a finite vocabulary (consisting of words) • Each image is a weighted vector of the word frequencies
Straight vocabulary – created by clustering – e.g. 12 dimensional feature needs 12 clusters image descriptors words M=1 more general words Vocabulary tree: created using hierarchical k-means on SIFT features; final vocabulary size=3+9=12 Each feature belongs to one “word” at each level M=3 M=9 most specific words
Straight Vocabulary vs Vocabulary Tree • Straight vocabulary: • Does not consider relationship between words • That is, ignores that certain words are closer to each other than other words. • At very coarse level (dictionary size ~10-20), additional words are more descriptive than the relationship among words. Therefore, outperforms vocabulary tree. • In our experiments, low-dimensional SIFT features, obtained using straight vocabulary, perform much better as “fingerprints” than tree-based features
P=N/K frames, where each window has P frames P frames K x 125 Video Fingerprint Extraction for each of K windows P frames Video Fingerprint N frames Computing the 125-dim YCbCr Histogram in YCbCr Space using P Consecutive Frames and thus avoiding Key Frames Extraction. Whole color space is quantized into 125 bins (5 bins for each of Y, Cb and Cr). Non-keyframe based Video Fingerprint • Features used for fingerprint creation: • YCbCr histogram based feature
K ½ ¾ X ( ) j j ( ) ( ) j j d X Y X Y i i j ¡ ( ) m n d b l f 6 X Y X Y i i i 0 = ; s p o s s e e v e n 1 ; = = ; · · K j 1 i 1 = ( ) ( ) d 6 d X Y Y X = ; ; Signature Distance Computation • For two (K x d) fingerprints, X and Y, where X(i) = ith feature vector of X • Properties of this distance function: • Such a distance relation is called a “quasi-distance”
Motivation Behind Distance Function This closest-overlap based distance is robust to: Frame reordering: For 2 signatures, temporal sequence may not be maintained between them – e.g. a video consisting of a reordering of scenes from the same video is still regarded as a duplicate Frame drops: If frame drops occur or some video frames are corrupted by noise, distance between duplicate videos should still be small
Experiments and Results • We present precision-recall plots for both similarity and duplicate detection, over 3888 videos • CFMT for dimensions 36/24/20/12/4 • SIFT for dimensions 781/341/33/21/12 • CFMT vs best performing SIFT for duplicate detection • SIFT vs best performing CFMT for similarity detection • CFMT performs better for duplicate detection • SIFT performs better for similarity detection
Precision-recall curves for different dimensional CFMT for duplicate detection Precision-recall curves for different dimensional CFMT for similarity detection
Precision-recall curves for different dimensional SIFT for duplicate detection Precision-recall curves for different dimensional SIFT for similarity detection
Precision-recall curves comparing different descriptors for duplicate detection Precision-recall curves, comparing different descriptors for similarity detection
Full-length Video Retrieval with Clip Querying • Generation of the small-length query: • We put together 4 different scenes from a full length video to create our input query: • Each individual scene is represented by 8 keyframes • For a single query, we have 4x8=32 keyframes • We experiment with different features for query representation • Repository is of full-length video signature (65 videos): • Number of keyframes used to create the signature size for “large video” is varied from 1%-4% of video length
( ) j j ( ) ( ) j j ( ) ¢ X X i i i j i 1 3 2 1 · · ¡ m n = l q u e r y a r g e 1 ; j 3 2 X ( ) ( ) = ( ) D X X ¢ i 3 2 2 = l q u e r y a r g e ; i 1 = Algorithm • Step 1: Input query signature Xquery is a (32 x d) matrix • Step 2: Its distance from all the stored “large video” signatures (Xlarge) is computed, as shown below: • Step 3: The best matched video is returned
Retrieval results for 1% summary lengths for “large” videos Retrieval results for 4% summary lengths for “large” videos
Conclusions • CFMT features provide quick/accurate retrieval for duplicate videos • SIFT features perform better for similar video detection • Future work • expanding the domain of “similar” videos (non-retakes yet still similar ?) • Importance of an efficient summary to create video signature (strategic keyframes vs random keyframes ?)
Thanks for your patience. Questions?