An In-Depth Evaluation of Multimodal Video Genre Categorization

An In-Depth Evaluation of Multimodal Video Genre Categorization 11th International Workshop on Content-Based Multimedia Indexing, CBMI 2013, Veszprém, Hungary, June 17-19, 2013. 1 3 2 University POLITEHNICA of Bucharest

Presentation outline • Introduction • Video Content Description • Fusion Techniques • Experimental results • Conclusions 2

Problem Statement Concepts • Content Based Video Retrieval • Genre Retrieval Query Results Query Database genre query 3

… label data web food autos train classifier unlabeled data labeled data Global Approach >challenge: find a way to assign (genre) tags to unknown videos; >approach: machine learning paradigm; tagged video database video database 4

objective 1: go multimodal (truly) visual Text & metadata audio Global Approach • the entire proces relies on the concept of “similarity” computed between content annotations (numeric features), • We focus on: objective 2: test a broad range of classifiers objective 3: test a broad range of fusion techniques

Zero-Crossing Rate, • Linear Predictive Coefficients, time • Line Spectral Pairs, • Mel-Frequency Cepstral Coefficients, global feature = mean & variance • spectral centroid, flux, rolloff, and kurtosis, … fn f1 f2 + var{f2} + variance of each feature over a certain window. var{fn} Video Content Description - audio • Standard audio features (audio frame-based) [B. Mathieu et al., Yaafe toolbox, ISMIR’10, Netherlands]

Local Binary Pattern, • Autocorrelogram, global feature = mean & dispersion & skewness & kurtosis & median & root mean square • Color Coherence Vector, • Color Layout Pattern, • Edge Histogram, time • Classic color histogram, … fn f1 f2 • Structure Color Descriptor, • Color moments. Video Content Description - visual • MPEG-7 & color/texture descriptors (visual frame-based) [OpenCV toolbox, http://opencv.willowgarage.com]

Bag-of-Visual-Words Framework Detection on interest points Codewords Dictionary Video Content Description - visual Feature descriptors Bag of Words • we train the model with 4,096 words • rgbSIFT and spatial pyramids (2x2) Generate BoW histograms Train classifier [CIVR 2009, J. Uijlings et all]

Video Content Description - visual Feature descriptors Histogram of oriented Gradients (HoG) • divides the image into 3x3 cells and for each of them builds a pixel-wise histogram of edge orientations. [CITS 2009, O. Ludwig,et all]

Video Content Description - visual Structural descriptors Objective: describe structural information in terms of contours and their relations; Contour properties: : degree of curvature (proportional to the maximum amplitude of the bowness space); – straight vs. bow : degree of circularity; – ½ circle vs. full circle : edginess parameter – zig-zag vs. sinusoid; : symmetry parameter – irregular vs. “even” edginess symmetry + Appearance parameters: : mean, std.dev. of intensity along the contour; : fuzziness, obtained from a blob (DOG) filter: I * DOG [IJCV, C. Rasche’10] 10

1. remove XML markups, 2. remove terms <5%-percentile of the frequency distribution, 3. select term corpus: retaining for each genre classm terms (e.g. m = 150 for ASR and 20 for metadata) withthe highest χ2 values that occur more frequently than incomplement classes, 4. for each document we represent the TF-IDF values. Video Content Description - text • TF-IDFdescriptors (Term Frequency-Inverse Document Frequency) Text sources: ASR and metadata

Classifiers We test a broad range of classifiers: • SVM with linear, RBF and Chi kernels • 5-NN • Random Trees and Extremely Random Trees 12

Descriptor 1 Fusion Techniques Early Fusion Descriptor 1 normalized Global Confidence score Global Descriptor Decision Classifier Descriptor 2 normalized Descriptor 2 Descriptor n Descriptor n normalized Feature concatenation Feature extraction Classification Step Obtain the Global Confidence Score Feature Normalization 13

Descriptor 1 Fusion Techniques Late Fusion Classifier 1 Confidence value 1 (normalized) Decision Classifier 2 Confidence value 2 (normalized) Global Confidence score Descriptor 2 Descriptor n Confidence value n (normalized) Classifier n Confidence Scores Normalization Feature extraction Classification Step Global Confidence Score 14

Fusion Techniques Late Fusion where - cvi is the confidence value of classifier i for class q , d is the current video, i are some weights and N is the number of classifiers to be aggregated. - rank() represents the rank of classifier i. 15

Experimental Setup MediaEval 2012 Dataset - Tagging Task • 14,838 episodes from 2,249 shows ~ 3,260 hours of data • splited into Development and Test sets 5,288 for development / 9,550 for test • focuses on semi-professional video on the Internet 16

Experimental Setup MediaEval 2012 Dataset • 26 Genre labels 1000 art 1001 autos_and_vehicles 1002 business 1003 citizen_journalism 1004 comedy1005 conferences_and_other_events 1006 default_category1007 documentary 1008 educational1009 food_and_drink 1010 gaming 1011 health 1012 literature1013 movies_and_television 1014 music_and_entertainment 1015 personal_or_auto-biographical 1016 politics1017 religion 1018 school_and_education 1019 sports 1020 technology1021 the_environment 1022 the_mainstream_media 1023 travel 1024 videoblogging1025 web_development_and_sites 17

Experimental Setup • Mean Average Precision summarizes rankings from multiple queries by averaging average precision • Classifier’s parameters and late fusion weights were optimized on training dataset 18

Evaluation (1) Classification performance on individual modalities (MAP values) 19

Evaluation (1) Classification performance on individual modalities (visual) (MAP values) Visual Performance - Best performance with MPEG-7 (ERF) and HOG (SVM-RBF) - Bag-of-Visual-Words is not performing very well 20

Evaluation (1) Classification performance on individual modalities (audio) (MAP values) Audio Performance - Best performance with Extremely Random Forests (42.33%) - Provide higher discriminative power than visual features 21

Evaluation (1) Classification performance on individual modalities (text) (MAP values) Text Performance - Best performance with Metadata and Random Forests (58.66%) - ASR provide lower performance than audio - Metadata features outperformes all the features 22

Evaluation (2) Performance on Multimodal Integration (MAP values) Fusion Techniques Performance - late fusion provide higher performance than early fusion - CombMNZ tends to provide the best accurate results 23

Evaluation (3) Comparison to MediaEval 2012 Tagging task results (MAP values) 24

Conclusions > we provided an in-depth evaluation of truly multimodal video description in the context of a real-world genre-categorization scenario; > we demonstrated the potential of appropriate late fusion to genre categorization and achieve very high categorization performance; > we proved that late fusion can boost performance of automated content descriptors to achieve close performance; > we setup a new baseline for the Genre Tagging Task by outperforming the performance of the other participants; Acknowledgement: we would like to thank Prof. Nicu Sebe and Dr. J. Uijlings from University of Trento for their support. We also acknowledge the 2012 Genre Tagging Task of the MediaEval Multimedia Benchmark for the dataset (http://www.multimediaeval.org/). 25

Thank you! Questions? 26

An In-Depth Evaluation of Multimodal Video Genre Categorization