Content-based Video Indexing, Classification & Retrieval

Content-based Video Indexing, Classification & Retrieval Presented by HOI, Chu Hong Nov. 27, 2002

Outline • Motivation • Introduction • Two approaches for semantic analysis • A probabilistic framework (Naphade, Huang ’01) • Object-based abstraction and modeling [Lee, Kim, Hwang ’01] • A multimodal framework for video content interpretation • Conclusion

Motivation • There is an amazing growth in the amount of digital video data in recent years. • Lack of tools for classify and retrieve video content • There exists a gap between low-level features and high-level semantic content. • To let machine understand video is important and challenging.

Introduction • Content-based Video indexing • the process of attaching content based labels to video shots • essential for content-based classification and retrieval • Using automatic analysis techniques - shot detection, video segmentation - key frame selection - object segmentation and recognition - visual/audio feature extraction - speech recognition, video text, VOCR

Introduction • Content-based Video Classification • Segment & classify videos into meaning categories • Classify videos based on predefined topic • Useful for browsing and searching by topic • Multimodal method • Visual features • Audio features • Motion features • Textual features • Domain-specific knowledge

Introduction • Content-based Video Retrieval • Simple visual feature query • Retrieve video with key-frame: Color-R(80%),G(10%),B(10%) • Feature combination query • Retrieve video with high motion upward(70%), Blue(30%) • Query by example (QBE) • Retrieve video which is similar to example • Localized feature query • Retrieve video with a running car toward right • Object relationship query • Retrieve video with a girl watching the sun set • Concept query (query by keyword) • Retrieve explosion, White Christmas

Introduction • Feature Extraction • Color features • Texture features • Shape features • Sketch features • Audio features • Camera motion features • Object motion features

Semantic Indexing & Querying • Limitation of QBE • Measuring similarity using only low-level features • Lack reflection of user’s perception • Difficult annotation of high level features • Syntactic to Semantic • Bridge the gap between low-level feature and semantic content • Semantic indexing, Query By Keyword (QBK) • Semantic description scheme – MPEG-7 • Semantic interaction between concepts • no scheme to learn the model for individual concepts

Semantic Modeling & Indexing • Two approaches • Probabilisticframework, ‘Multiject’ (Naphade’01) • Object-based abstraction and indexing [Lee, Kim, Hwang ’01]

A probabilistic approach (‘Multiject’ & ‘Multinet’) (Naphade, Huang ’01) • a probabilistic multimedia object • 3 categories semantic concepts • Objects • Face, car, animal, building • Sites • Sky, mountain, outdoor, cityscape • Events • Explosion, waterfall, gunshot, dancing

Multiject for semantic concept P( Outdoor = Present | features, other multijects) = 0.7 Other multijects Outdoor Visual features Audio features Text features

How to create a Multiject • Shot-boundary detection • Spatio-temporal segmentation of within-shot frames • Feature extraction (color, texture, edge direction, etc ) • Modeling • Sites: mixture of Gaussians • Events: hidden Markov models (HMMs) with observation densities as gaussian mixtures • All audio events: modeled using HMMs • Each segment is tested for each concept and the information is then composed at frame level

Multiject : Hierarchical HMM ss1 - ssm : state sequence for supervisor HMM sa1 - sam : state sequence for audio HMM xa1 - xam : audio observations sv1 - svm : state sequence for video HMM xv1 - xvm : video observations

Multinet: Concept Building based on Multiject • A network of multijects modeling interaction between them • + / - : positive/negative interaction between multijects

Bayesian Multinet • Nodes : binary random variables (presence/absence of multiject) • Layer 0 : frame-level multiject-based semantic features • Layer 1 : inference from layer 0 : • Layer 2 : higher level for performance improvement

Video Sequence VO Extraction Object-based Video Abstraction Object-based Low-Level Feature Extraction Indexing /Retrieving Semantic Features Modeling Object-based SemanticVideo Modeling

In In-1 von-1 Motion Projection Model Update (Histogram Backprojection) delay Object Post-processing von Object Extraction based on Object Tracking [Kim, Hwang ‘00]

Object Features HMM Training Pre-processing Abstracted frame sequence Semantic Feature Modeling • Modeling based on temporal variation of object features • Boundary shape and motion statistics of object area

….. S1 S2 ST HMM Modeling 1. Observation Sequence O1 ……. OT . . . . object features 2. Left-Right 1-D HMM modeling

Video Modeling: Three Layer Structure Three layer structure of video modeling, compared to NLP Video Understanding Natural Language Processing Content Interpretation Interpretation Semantic Video Modeling Frame-based Structural Modeling Object-based Structural Modeling Sentence Structure & grammar Word Recognition Audio-Visual Feature Extraction

A Multimodal Framework for Video Content Interpretation • Long-term goal • Application on automatic TV Programs Scout • Allow user to request topic-level programs • Integrate multiple modalities: visual, audio and Text information • Multi-level concepts • Low: low-level feature • Mid: object detection, event modeling • High: classification result of semantic content • Probabilistic model, Using Bayesian network for classification (causal relationship, domain-knowledge)

How to work with the framework? • Preprocessing • Story segmentation (shot detection) • VOCR, Speech Recognition • Key frame selection • Feature Extraction • Visual features based on key-frame • Color, texture, shape, sketch, etc. • Audio features • average energy, bandwidth, pitch, mel-frequency cepstral coefficients, etc. • Textual features (Transcript) • Knowledge tree, a lot of keyword categories: politics, entertainment, stock, art, war, etc. • Word spotting, vote histogram • Motion features • Camera operation: Panning, Tilting, Zooming, Tracking, Booming, Dollying • Motion trajectories (moving objects) • Object abstraction, recognition • Building and training the Bayesian network

Challenging points • Preprocessing is significant in the framework. • Accuracy of key-frame selection • Accuracy of speech recognition & VOCR • Good feature extraction is important for the performance of classification. • Modeling semantic video objects and events • How to integrate multiple modalities still need to be well considered.

Conclusion • Introduction of several basic concepts • Semantic video modeling and indexing • Propose a multimodal framework for topic classification of Video • Discussion of Challenging problems

Q & A Thank you!

Content-based Video Indexing, Classification & Retrieval