270 likes | 469 Views
Visual Event Recognition in Videos by Learning from Web Data. Lixin Duan † , Dong Xu † , Ivor Tsang † , Jiebo Luo ¶ † Nanyang Technological University, Singapore ¶ Kodak Research Labs, Rochester, NY, USA. Outline. Overview of the Event Recognition System Similarity between Videos
E N D
Visual Event Recognition in Videos by Learning from Web Data LixinDuan†, Dong Xu†, Ivor Tsang†, JieboLuo¶ †Nanyang Technological University, Singapore ¶ Kodak Research Labs, Rochester, NY, USA
Outline • Overview of the Event Recognition System • Similarity between Videos • Aligned Space-Time Pyramid Matching • Cross-Domain Problem • Adaptive Multiple Kernel Learning • Experiments • Conclusion
Overview • GOAL: Recognize consumer videos • Large intra-class variability; limited labeled videos Wedding Sports Picnic
Overview • GOAL: Recognize consumer videos by leveraging a large number of loosely labeled web videos (e.g., from YouTube) Wedding Consumer Videos Sports A Large Number of Web Videos Picnic
Overview • Flowchart of the system Video Database Test video Output Classifier
Similarity between Videos • Pyramid matching methods • Temporally aligned pyramid matching, D. Xu and S.-F. Chang [1] • Unaligned space-time pyramid matching, I. Laptev [2] Time axis Space axes Space-time axes
Similarity between Videos • Aligned Space-Time Pyramid Matching • Each video is divided into non-overlapped space-time volumes, where . • Greater variability • Two-step approach • Distances between space-time volumes: solved by existing methods such as bag-of-words model, I. Laptev [2]
Similarity between Videos • Aligned Space-Time Pyramid Matching • Level 1 Distance
Similarity between Videos Distance • Integer-flow Earth Mover’s Distance (EMD), Y. Rubner [3] s.t.
Similarity between Videos • Integer-flow Earth Mover’s Distance (EMD), Y. Rubner [3] Distance s.t.
Cross-Domain Problem • Data distribution mismatch between consumer videos and web videos • Consumer videos: Naturally captured • Web videos: Edited; Selected • Maximum Mean Discrepancy (MMD), K. M. Borgwardt[4] where , and .
Cross-Domain Problem • Suppose there are pre-learned classifiers • is learned by SVM with the labeled training data from both domains • Proposed target decision function Prior information where is the linear combination coefficient and is the perturbation function.
Cross-Domain Problem • Motivated by Multiple Kernel Learning (MKL) (F. Bach [5]), perturbation function • MKL: • MMD where . , where where
Cross-Domain Problem • Adaptive Multiple Kernel Learning (A-MKL) MMD Structural risk functional where
Cross-Domain Problem • Dual form of • A-MKL algorithm • Iteratively solve the linear coefficients and the dual variables in the dual form of .
Cross-Domain Problem • Feature Replication (FR), H. DauméIII [6] • Augment features • Domain Transfer SVM (DTSVM), L. Duan [7] • No prior information • Adaptive SVM (A-SVM), J. Yang [8] • is pre-defined • is modeled by SVM
Experiments • Data set • 195 consumer videos and 906 web videos collected by ourselves and from Kodak Consumer Video Benchmark Data Set [5] • 6 events: “wedding”, “birthday”, “picnic”, “parade”, “show” and “sports” • Training data: 3 videos per event from consumer videos and all web videos • Test data: The rest consumer videos
Experiments • Two types of features • Space-time (ST) feature, Laptev et al. [1] • SIFT feature, Lowe [2] • Four types of base kernels • Gaussian: • Laplacian: • Inverse Square Distance: • Inverse Distance:
Experiments Unaligned Aligned • Aligned Space-Time Pyramid Matching (ASTPM) vs. Unaligned Space-Time Pyramid Matching (USTPM) • ASTPM is better than USTPM at Level 1
Experiments • 80 base kernels in total: 2 pyramid levels, 2 types of features, 5 kernel parameters and 4 types of kernels • Average classifiers at Level () • : 20 base classifiers learned by SVM • : 20 base classifiers learned by SVM • Pre-learned classifiers : 4 average classifiers
Experiments • Comparisons of cross-domain learning methods • (a) SIFT features • (b) ST features • (c) SIFT features and ST features • “parade”: 75.7% (A-MKL) vs. 62.2% (FR)
Experiments • Comparisons of cross-domain learning methods • Relative improvements • SVM_T: 36.9% • SVM_AT: 8.6% • Feature Replication (FR) [6]: 7.6% • Adaptive SVM (A-SVM) [7]: 49.6% • Domain Transfer SVM (DTSVM) [8]: 9.9% • MKL-based methods • Better fuse SIFT features and ST features • Handle noise in the loose labels
Conclusion • We propose a new event recognition framework for consumer videos by leveraging a large number of loosely labeled web videos. • We develop a new aligned space-time pyramid matching method. • We present a new cross-domain learning method A-MKL which handles the mismatch between the data distributions of the consumer video domain and the web video domain.
References [1] D. Xu and S.-F. Chang. Video event recognition using kernel methods with multi-level temporal alignment. T-PAMI, 30(11):1985–1997, 2008. [2] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. [3] Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth mover’s distance as a metric for image retrieval. IJCV, 40(2): 99-121, 2000. [4] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. Smola. Integrating structured biological data by kernel maximum mean discrepancy. In ISMB, 2006.
References [5] F. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality and the SMO algorithm. In ICML, 2004. [6] H. Daumé III. Frustratingly easy domain adaptation. In ACL, 2007. [7] L. Duan, I. W. Tsang, D. Xu, and S. J. Maybank. Domain transfer svm for video concept detection. In CVPR, 2009. [8] J. Yang, R. Yan, and A. G. Hauptmann. Cross-domain video concept detection using adaptive svms. In ACM MM, 2007. [9] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004.