Indexing of video sequences: a generic approach for handling multiple specific features

Indexing of video sequences: a generic approach for handling multiple specific features Nicolas Moënne-Loccoz, Eric Bruno and Stéphane Marchand-Maillet Viper group Computer Vision and Multimedia Lab University of Geneva, CH 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Outline • Content-based video indexing • Event-based (specific) video indexing • Bags of trajectories • Interactive classification • Results • Interactive (generic) indexing process • Multimodal fusion • Dissimilarity representation • Some results 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Content-based video indexing Multimodal content abstraction: • From raw signal, infer high-level properties • Raw signal = (audio,video, text,metadata…)n • High-level properties • Semantic labels at various levels (event, object, story,…) • Data characteristics (up, down, slow, exciting,…) • Essentially 2 strategies • Supervised classification (eg activity recognition) • Interactive retrieval (eg QBE) 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Event characterization • Event: long-term spatio-temporal object [Irani 99] • Issues: • Object localization • Object supporting region • Assumption made (eg motion history): • Single event • Static camera 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Ideal against practical • Ideal case: 1 event (walking) and no camera motion • Real case: crowd (possibly camera motion) Well-defined signature Problems essentially due to temporal projection 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Unconstrained event modeling • Create a robust event representation • In terms of scale • In terms of lighting conditions • In terms of contents • … • Create a sparse event representation • Concentrate on salient content • Use the bag of trajectories strategyto index events 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Robust, sparse event representation • Based on local features in spatial domain • Based on the notion of saliency • Entropy, cornerness, frequency • For “every” video frame Ft: local featuresWt= {wt} • wt = (position, orientation, scale) = (vt , Qt , st) 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Temporal handling • Match features frame per frame • Best bipartite match between feature sets Wt et Wt+dt • Greedy match (approximation) • Hungarian algorithm • "Motion field" for frame Ft • Trajectory from t to t+k: z[t,t+k]= {wt ,wt+1 … ,wt+k} 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Temporal persistency • Sufficient for rough global motion model estimation • Affine model estimation • Global motion (camera motion) compensation 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Event representation • Issues: • Variable number of features • Variable size of trajectories • Bag of features representation • Trajectory quantization • polar coordinates of motion vector • normalised scale parameter 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Multiscale histograms of trajectories • Multiscale histograms [Chen05] 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Learning events • State-of-the-art classifier: SVM • Histogram-compliant kernel function: kernel where: 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Results • LFT: Local feature trajectories • Corpora • Laptev corpus for specific event detection • Caviar corpus for specific event detection • TrecVid corpus for generic event classification • Baselines • HMH: Motion Histograms – Hu moments [Bobick 2001]. SVM with RBF kernel • MGH: Histograms of Multiscale Spatiotemporal gradients SVM with kernel [Irani99] 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Corpus I • I. Laptev’s human activity datasets • > 2000 videos (25 persons, 5 scenarios, 6 activities) 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Examples Running Jogging Handclapping 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Results I 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Corpus II • CAVIAR Shop Monitor • Multiple occurrences (5 events) Context Aware Vision using Image-based Active Recognition 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Results II 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Corpus III • TrecVid News broadcast corpus • > 2000 shots (6 events) 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Results III 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

From abstraction to indexing • Robust temporal content representation based on salient content • This strategy leads to event classification • May be a soft indicator for these events • How to combine this information with other features? • Multimodal fusion… • … at query time  interactive Multimodal fusion 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Content-based video indexing Multimodal content abstraction: • From raw signal, infer high-level properties • Raw signal = (audio,video, text,metadata…)n • High-level properties • Semantic labels at various levels (event, object, story,…) • Data characteristics (up, down, slow, exciting,…) • Essentially 2 strategies • Supervised classification (eg activity recognition) • Interactive retrieval (eg QBE) 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

External video indexing • Hundred of thousand of high-dimensional descriptors associated to various modalities • Video segments need to be compared to each others according to their features • The index consists in dissimilarities matrices computed off-line • Fast retrieval • Homogeneity of the index whatever the features used • But initial feature values are lost! • How to efficiently store updatable matrices? 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Interactive multimodal retrieval • Query-by-example (QBE) paradigm associated to relevance feedback • Set of positive & negative examples : • Multiple feature spaces available through dissimilarity matrices • Set of M feature space distances : • Constraint: real-time interactions • Dissimilarity based-learning 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Dissimilarity space • Pair-wise dissimilarities replace features • R={p1,…,pN} is the representation set, then the dissimilarity space is: • If R=S+, • Low dimensional space (size N) • 1+x to 1+1 classification x2 d+2 x1 d+1 Feature space Dissimilarity space 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Kernel Discriminant Analysis • Relevance feedback: user gives • S+ positive examples pi+ di+ • S-negative examples pi- di- • Estimate a ranking function that places positives on the top and pushes negatives to the end • Solution is an expansion of kernel functions centered on training vectors 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Multimodal Analysis • Dissimilarities are known for M features • Multiple dissimilarity spaces • Concatenation • Multimodal RBF kernel 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Example d(z, p1+) d(z, p2+) 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Evaluation • TRECVid 2003 corpus  around 120 hrs of annotated videos • 37’000 shots indexed by low-level features • Global color histogram • Global motion histogram (MPEG motion vectors) • ASR histogram (word occurrences and co-occurrences) • Euclidean distance between histograms 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Adding modalities Average Precision(100 instances of the query « Basketball ») 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Varying the size of the training set Average Precision(100 instances of the query « Basketball ») 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Cont’d Average Precision(100 instances of the query « Basketball ») 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Conclusion • Some specific features may be designed as an instantiation of domain/expert knowledge • Robustness should be preserved • They may not solve all interesting problems • They should be part of a more generic framework • Dissimilarty representation for homogeneous features • Interactive learning for online relevance feedback 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Vicode 2.0 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Summary • Dissimilarity spaces allow tractable computation on large collections represented in high dimensional spaces • Multimodal dissimilarity (MD) space • Real-time user interactions • KFD seems to be able to learn in MD space, but problem with kernel selection and tuning • Combination of kernels (e.g. linear with rbf)? • Interactive learning of kernel parameters? 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Interfaces • Relevance feedback acquisition for temporal audiovisual data ? • Visualization for multimodal temporal data ? • Visualisation for collection at several levels (frame, shot story,…)? 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Visual QBE of video shots • Interface for efficient RF interaction 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Visual exploration of retrieval results • Multimodal visualisation (audio, ASR, motion)… 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Visual exploration of video documents • Document visualization at multiple granularities (frame, atom, shot, story,…) 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

Indexing of video sequences: a generic approach for handling multiple specific features

Indexing of video sequences: a generic approach for handling multiple specific features

Presentation Transcript

Multiple Alignment

The Central Dogma of Life.

Dust Explosions

Trauma Informed Sex Offense Specific Treatment An approach to CBT-RP Treatment

GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Compelling audio and video for Metro style games

Chapter 8: Deadlocks

AMPI and Charm++

Sequence Alignment

Algorithms for Discovering Patterns in Sequences

Alignment of large genomic sequences

Welcome to the AVA619H00 IP Office Telephone End-User Training

Multiple Sequence Alignment (MSA)

1. Overview: The Flow of Genetic Information

Sequence Indexing Schemes

CS 245: Database System Principles Notes 4: Indexing

Latent Semantic Indexing

Exception Handling

Social Video Formula review and (Free) $21,400 Bonus & Discount