400 likes | 564 Views
Indexing of video sequences: a generic approach for handling multiple specific features. Nicolas Moënne-Loccoz, Eric Bruno and Stéphane Marchand-Maillet Viper group Computer Vision and Multimedia Lab University of Geneva, CH. Outline. Content-based video indexing
E N D
Indexing of video sequences: a generic approach for handling multiple specific features Nicolas Moënne-Loccoz, Eric Bruno and Stéphane Marchand-Maillet Viper group Computer Vision and Multimedia Lab University of Geneva, CH 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Outline • Content-based video indexing • Event-based (specific) video indexing • Bags of trajectories • Interactive classification • Results • Interactive (generic) indexing process • Multimodal fusion • Dissimilarity representation • Some results 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Content-based video indexing Multimodal content abstraction: • From raw signal, infer high-level properties • Raw signal = (audio,video, text,metadata…)n • High-level properties • Semantic labels at various levels (event, object, story,…) • Data characteristics (up, down, slow, exciting,…) • Essentially 2 strategies • Supervised classification (eg activity recognition) • Interactive retrieval (eg QBE) 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Event characterization • Event: long-term spatio-temporal object [Irani 99] • Issues: • Object localization • Object supporting region • Assumption made (eg motion history): • Single event • Static camera 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Ideal against practical • Ideal case: 1 event (walking) and no camera motion • Real case: crowd (possibly camera motion) Well-defined signature Problems essentially due to temporal projection 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Unconstrained event modeling • Create a robust event representation • In terms of scale • In terms of lighting conditions • In terms of contents • … • Create a sparse event representation • Concentrate on salient content • Use the bag of trajectories strategyto index events 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Robust, sparse event representation • Based on local features in spatial domain • Based on the notion of saliency • Entropy, cornerness, frequency • For “every” video frame Ft: local featuresWt= {wt} • wt = (position, orientation, scale) = (vt , Qt , st) 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Temporal handling • Match features frame per frame • Best bipartite match between feature sets Wt et Wt+dt • Greedy match (approximation) • Hungarian algorithm • "Motion field" for frame Ft • Trajectory from t to t+k: z[t,t+k]= {wt ,wt+1 … ,wt+k} 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Temporal persistency • Sufficient for rough global motion model estimation • Affine model estimation • Global motion (camera motion) compensation 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Event representation • Issues: • Variable number of features • Variable size of trajectories • Bag of features representation • Trajectory quantization • polar coordinates of motion vector • normalised scale parameter 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Multiscale histograms of trajectories • Multiscale histograms [Chen05] 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Learning events • State-of-the-art classifier: SVM • Histogram-compliant kernel function: kernel where: 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Results • LFT: Local feature trajectories • Corpora • Laptev corpus for specific event detection • Caviar corpus for specific event detection • TrecVid corpus for generic event classification • Baselines • HMH: Motion Histograms – Hu moments [Bobick 2001]. SVM with RBF kernel • MGH: Histograms of Multiscale Spatiotemporal gradients SVM with kernel [Irani99] 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Corpus I • I. Laptev’s human activity datasets • > 2000 videos (25 persons, 5 scenarios, 6 activities) 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Examples Running Jogging Handclapping 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Results I 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Corpus II • CAVIAR Shop Monitor • Multiple occurrences (5 events) Context Aware Vision using Image-based Active Recognition 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Results II 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Corpus III • TrecVid News broadcast corpus • > 2000 shots (6 events) 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Results III 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
From abstraction to indexing • Robust temporal content representation based on salient content • This strategy leads to event classification • May be a soft indicator for these events • How to combine this information with other features? • Multimodal fusion… • … at query time interactive Multimodal fusion 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Content-based video indexing Multimodal content abstraction: • From raw signal, infer high-level properties • Raw signal = (audio,video, text,metadata…)n • High-level properties • Semantic labels at various levels (event, object, story,…) • Data characteristics (up, down, slow, exciting,…) • Essentially 2 strategies • Supervised classification (eg activity recognition) • Interactive retrieval (eg QBE) 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
External video indexing • Hundred of thousand of high-dimensional descriptors associated to various modalities • Video segments need to be compared to each others according to their features • The index consists in dissimilarities matrices computed off-line • Fast retrieval • Homogeneity of the index whatever the features used • But initial feature values are lost! • How to efficiently store updatable matrices? 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Interactive multimodal retrieval • Query-by-example (QBE) paradigm associated to relevance feedback • Set of positive & negative examples : • Multiple feature spaces available through dissimilarity matrices • Set of M feature space distances : • Constraint: real-time interactions • Dissimilarity based-learning 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Dissimilarity space • Pair-wise dissimilarities replace features • R={p1,…,pN} is the representation set, then the dissimilarity space is: • If R=S+, • Low dimensional space (size N) • 1+x to 1+1 classification x2 d+2 x1 d+1 Feature space Dissimilarity space 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Kernel Discriminant Analysis • Relevance feedback: user gives • S+ positive examples pi+ di+ • S-negative examples pi- di- • Estimate a ranking function that places positives on the top and pushes negatives to the end • Solution is an expansion of kernel functions centered on training vectors 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Multimodal Analysis • Dissimilarities are known for M features • Multiple dissimilarity spaces • Concatenation • Multimodal RBF kernel 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Example d(z, p1+) d(z, p2+) 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Evaluation • TRECVid 2003 corpus around 120 hrs of annotated videos • 37’000 shots indexed by low-level features • Global color histogram • Global motion histogram (MPEG motion vectors) • ASR histogram (word occurrences and co-occurrences) • Euclidean distance between histograms 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Adding modalities Average Precision(100 instances of the query « Basketball ») 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Varying the size of the training set Average Precision(100 instances of the query « Basketball ») 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Cont’d Average Precision(100 instances of the query « Basketball ») 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Conclusion • Some specific features may be designed as an instantiation of domain/expert knowledge • Robustness should be preserved • They may not solve all interesting problems • They should be part of a more generic framework • Dissimilarty representation for homogeneous features • Interactive learning for online relevance feedback 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Vicode 2.0 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Summary • Dissimilarity spaces allow tractable computation on large collections represented in high dimensional spaces • Multimodal dissimilarity (MD) space • Real-time user interactions • KFD seems to be able to learn in MD space, but problem with kernel selection and tuning • Combination of kernels (e.g. linear with rbf)? • Interactive learning of kernel parameters? 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Interfaces • Relevance feedback acquisition for temporal audiovisual data ? • Visualization for multimodal temporal data ? • Visualisation for collection at several levels (frame, shot story,…)? 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Visual QBE of video shots • Interface for efficient RF interaction 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Visual exploration of retrieval results • Multimodal visualisation (audio, ASR, motion)… 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE
Visual exploration of video documents • Document visualization at multiple granularities (frame, atom, shot, story,…) 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE