510 likes | 523 Views
Unsupervised and weakly-supervised discovery of events in video (and audio). Fernando De la Torre. A dream. Outline. Introduction CMU-Multimodal Activity database Unsupervised discovery of video events Aligned Cluster Analysis (ACA) Weakly-supervised discovery of video events
E N D
Unsupervised and weakly-supervised discovery of events in video(and audio) Fernando De la Torre
Outline • Introduction • CMU-Multimodal Activity database • Unsupervised discovery of video events • Aligned Cluster Analysis (ACA) • Weakly-supervised discovery of video events • Detection-Segmentation SVMs • Conclusions
Multimodal data collection • 40 subjects, 5 recipes • www.kitchen.cs.cmu.edu
Multimodal data collection • 40 subjects, 5 recipes • www.kitchen.cs.cmu.edu
Time series analysis • Anomalous detection formulated as detecting outliers in multimodal time series. • Supervised • Unsupervised • Semi-supervised or weakly supervised
Time series analysis • Anomalous detection formulated as detecting outliers in multimodal time series. • Supervised • Unsupervised • Semi-supervised or weakly supervised
Motivation • Mining facial expression for one subject
Motivation • Mining facial expression for one subject • Mining facial expression for one subject • Summarization • Visualization • Indexing
Motivation Looking forward • Mining facial expression for one subject Sleeping Waking up Smiling Looking up • Summarization • Visualization • Indexing
Motivation • Mining facial expression of one subject • Summarization • Embedding • Indexing
Motivation • Mining facial expression for one subject • Summarization • Embedding • Indexing
Related work in time series • Change point detection (e.g. Page ‘54, Stephens 94’, Lai ‘95, Ge and Smyth ‘00, Steyvers & Brown ’05, Murphy et al. ‘07, Harchaoui et al. ‘08) • Segmental HMMs (e.g. Ge and Smith ‘00, Kohlmoren et al. ’01, Ding & Fan ‘07) • Mixtures of HMMs (e.g. Fine et al. ‘98, Murphy & Paskin ‘01, Oliver et al. ’02, Alon et al. ‘03) • Switching LDS (e.g. Pavolvic et al. ‘00, Oh et al. ‘08, Turaga et al. ‘09) • Hierarchical Dirichelet Process (e.g. Beal et al. ‘02, Fox et al. ‘08) • Aligned Cluster Analysis (ACA)
Kernel k-means and spectral clustering(Ding et al. ‘02, Dhillon et al. ‘04, Zass and Shashua ‘05, De la Torre ‘06) x x y y x y y 5 7 2 4 6 9 3 1 8 10 x
Problem formulation for ACA Labels (G) Start and end of the segments (h) Dynamic Time Alignment Kernel (Shimodaira et al. 01)
Problem formulation for ACA Dynamic Time Alignment Kernel (Shimodaira et al. 01) mc X X [Si , Si+1) [Si , Si+1) mc
Matrix formulation for ACA 23 frames, 3 clusters clusters segments segments samples Dynamic Time Alignment Kernel (Shimodaira et al. 01)
Facial image features • Active Appearance Models (Baker and Matthews ‘04) Appearance • Image features Shape Upper face Lower face
Facial event discovery across subjects • Cohn-Kanade: 30 people and five different expressions (surprise, joy, sadness, fear, anger)
Facial event discovery across subjects • Cohn-Kanade: 30 people and five different expressions (surprise, joy, sadness, fear, anger) • 10 sets of 30 people
Honey bee dance (Oh et al. ‘08) Three behaviors: 1-waggling 2-turning left 3-turning right
Similarity of these problems? • Global statistics are not distinctive enough! • Better understanding of the discriminative regions or events
Image Bag of ‘regions’ At least one positive All negative
Learning formulation • Standard SVM -1 -3 -2 3 -1 0.5 (Andrews et. al. ’03, Felzenszwalb et al. ‘08)
Optimization 1) 0.5 100ms/image (480*640 pixels) (Lampert et al. CVPR08) 0.1 all possible subwindows 2) 1 -1 -3 -2 2 3) SVM with QP
Discriminative patterns in time series We name it: k-segmentation At most k disjoint intervals 10ms/sequence (15000 frames) • Efficient search: Global optimum guaranteed!
Representation of signals Training data Compute frame-level feature vectors clustering Visual dictionary IDs of visual words
K-segmentation Original signal IDs of visual words Histogram of visual words We need:
What is ? IDs of visual words Original signal (x) SVM parameters Consider m-segmentation: m-segmentation (m+1)-segmentation Situation 1: Situation 2:
Experiment 1 – glasses vs. no-glasses • 624 images, 20 people under different expression/pose • 8 people training (126 sunglasses, 128 no glasses), 12 testing (185 sunglasses and 185 no glasses)
Experiment 2 – car vs. no car • 400 images, half contains cars and other half no cars. • Each image 10,000 SIFT descriptors and a vocabulary of 1,000 visual words.
Classification performance discriminative regions whole image Our method outperforms SVM with human labels!!! Human labels
Experiment 3 – synthetic data Positive class Negative class Accuracy Result k: maximum number of disjoint intervals.
Experiment 4 – mouse activity • Mouse activities: • Drinking, eating, exploring, grooming, sleeping
Conclusions • CMU Multimodal Activity database • Unsupervised discovery of events in time-series • Aligned Cluster Analysis for summarization, indexing and visualization of time-series • Code online (www.humansensing.cs.cmu.edu) • Open problems: automatic selection of number of clusters • Weakly-supervised discovery of events in time-series • DS-SVM • Novel & efficient algorithm for time series • Outperform methods with human labeled data • Kernel methods a fundamental framework for multimodal data fusion.