Visual Event Recognition in Videos by Learning from Web Data

Visual Event Recognition in Videos by Learning from Web Data LixinDuan†, Dong Xu†, Ivor Tsang†, JieboLuo¶ †Nanyang Technological University, Singapore ¶ Kodak Research Labs, Rochester, NY, USA

Outline • Overview of the Event Recognition System • Similarity between Videos • Aligned Space-Time Pyramid Matching • Cross-Domain Problem • Adaptive Multiple Kernel Learning • Experiments • Conclusion

Overview • GOAL: Recognize consumer videos • Large intra-class variability; limited labeled videos Wedding Sports Picnic

Overview • GOAL: Recognize consumer videos by leveraging a large number of loosely labeled web videos (e.g., from YouTube) Wedding Consumer Videos Sports A Large Number of Web Videos Picnic

Overview • Flowchart of the system Video Database Test video Output Classifier

Similarity between Videos • Pyramid matching methods • Temporally aligned pyramid matching, D. Xu and S.-F. Chang [1] • Unaligned space-time pyramid matching, I. Laptev [2] Time axis Space axes Space-time axes

Similarity between Videos • Aligned Space-Time Pyramid Matching • Each video is divided into non-overlapped space-time volumes, where . • Greater variability • Two-step approach • Distances between space-time volumes: solved by existing methods such as bag-of-words model, I. Laptev [2]

Similarity between Videos • Aligned Space-Time Pyramid Matching • Level 1 Distance

Similarity between Videos Distance • Integer-flow Earth Mover’s Distance (EMD), Y. Rubner [3] s.t.

Similarity between Videos • Integer-flow Earth Mover’s Distance (EMD), Y. Rubner [3] Distance s.t.

Cross-Domain Problem • Data distribution mismatch between consumer videos and web videos • Consumer videos: Naturally captured • Web videos: Edited; Selected • Maximum Mean Discrepancy (MMD), K. M. Borgwardt[4] where , and .

Cross-Domain Problem • Suppose there are pre-learned classifiers • is learned by SVM with the labeled training data from both domains • Proposed target decision function Prior information where is the linear combination coefficient and is the perturbation function.

Cross-Domain Problem • Motivated by Multiple Kernel Learning (MKL) (F. Bach [5]), perturbation function • MKL: • MMD where . , where where

Cross-Domain Problem • Adaptive Multiple Kernel Learning (A-MKL) MMD Structural risk functional where

Cross-Domain Problem • Dual form of • A-MKL algorithm • Iteratively solve the linear coefficients and the dual variables in the dual form of .

Cross-Domain Problem • Feature Replication (FR), H. DauméIII [6] • Augment features • Domain Transfer SVM (DTSVM), L. Duan [7] • No prior information • Adaptive SVM (A-SVM), J. Yang [8] • is pre-defined • is modeled by SVM

Experiments • Data set • 195 consumer videos and 906 web videos collected by ourselves and from Kodak Consumer Video Benchmark Data Set [5] • 6 events: “wedding”, “birthday”, “picnic”, “parade”, “show” and “sports” • Training data: 3 videos per event from consumer videos and all web videos • Test data: The rest consumer videos

Experiments • Two types of features • Space-time (ST) feature, Laptev et al. [1] • SIFT feature, Lowe [2] • Four types of base kernels • Gaussian: • Laplacian: • Inverse Square Distance: • Inverse Distance:

Experiments Unaligned Aligned • Aligned Space-Time Pyramid Matching (ASTPM) vs. Unaligned Space-Time Pyramid Matching (USTPM) • ASTPM is better than USTPM at Level 1

Experiments • 80 base kernels in total: 2 pyramid levels, 2 types of features, 5 kernel parameters and 4 types of kernels • Average classifiers at Level () • : 20 base classifiers learned by SVM • : 20 base classifiers learned by SVM • Pre-learned classifiers : 4 average classifiers

Experiments • Comparisons of cross-domain learning methods • (a) SIFT features • (b) ST features • (c) SIFT features and ST features • “parade”: 75.7% (A-MKL) vs. 62.2% (FR)

Experiments • Comparisons of cross-domain learning methods • Relative improvements • SVM_T: 36.9% • SVM_AT: 8.6% • Feature Replication (FR) [6]: 7.6% • Adaptive SVM (A-SVM) [7]: 49.6% • Domain Transfer SVM (DTSVM) [8]: 9.9% • MKL-based methods • Better fuse SIFT features and ST features • Handle noise in the loose labels

Conclusion • We propose a new event recognition framework for consumer videos by leveraging a large number of loosely labeled web videos. • We develop a new aligned space-time pyramid matching method. • We present a new cross-domain learning method A-MKL which handles the mismatch between the data distributions of the consumer video domain and the web video domain.

References [1] D. Xu and S.-F. Chang. Video event recognition using kernel methods with multi-level temporal alignment. T-PAMI, 30(11):1985–1997, 2008. [2] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. [3] Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth mover’s distance as a metric for image retrieval. IJCV, 40(2): 99-121, 2000. [4] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. Smola. Integrating structured biological data by kernel maximum mean discrepancy. In ISMB, 2006.

References [5] F. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality and the SMO algorithm. In ICML, 2004. [6] H. Daumé III. Frustratingly easy domain adaptation. In ACL, 2007. [7] L. Duan, I. W. Tsang, D. Xu, and S. J. Maybank. Domain transfer svm for video concept detection. In CVPR, 2009. [8] J. Yang, R. Yan, and A. G. Hauptmann. Cross-domain video concept detection using adaptive svms. In ACM MM, 2007. [9] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004.

Thank you!

Visual Event Recognition in Videos by Learning from Web Data

Visual Event Recognition in Videos by Learning from Web Data

Presentation Transcript

LEARNING FROM DATA

Visual Word Recognition

Activity Recognition from Trajectory Data

Learning Convolutional Feature Hierarchies for Visual Recognition

Visual Word Recognition

Visual Pattern Recognition

Scholarship Recognition Event

Event-by-event flow from ATLAS

Action Recognition in Temporally Untrimmed Videos

Visual Object Recognition

Visual Object Recognition

6.S093 Visual Recognition through Machine Learning Competition

Learning From Data

Visual Object Recognition

6.S093: Visual Recognition through Machine Learning Competition

Learning Event Durations from Event Descriptions

SUPER: Towards Real-time Event Recognition in Internet Videos

On Visual Recognition

4 .4 Object Recognition in Videos

Visual Object Recognition

Obtaining Data for Face Recognition from the web

Visual Recognition