Content Based Multimedia Signal Processing

Content Based Multimedia Signal Processing Yu Hen Hu University of Wisconsin – Madison

Outline • Multimedia content description Interface (MPEG-7) • Video content features • Spoken content features • Multimedia indexing, and retrieval • Multimedia summary, filtering • Other applications

Large amount of digital contents are available Easy to create, digitize, and distribute audio-visual content Family album syndrome Need organize, index, retrieval Information overloading Need filtering MPEG-7 Objective Provide inter-operability among systems and applications used in generation, management, distribution, and consumption of audio-visual content description. Help user to identify, retrieve, or filter audio-video information. MPEG-7 Overview

Summary, Generation of multimedia program guide or content summary Generation of content description of A/V archive to allow seamless exchange among content creator, aggregator, and consumer. Filtering Filter and transform multimedia streams in resource limited environment by matching user preference, available resource and content description. Retrieval Recall music using samples of tunes Recall pictures using sketches of shape, color movement, description of scenario Recommendation Recommend program materials by matching user preference (profile) to program content Indexing Create family photo or video library index Potential Application of MPEG-7

Descriptors MPEG-7 contains standardized descriptors for audio, visual, generic contents. Standardize how these content features are being characterized, but not how to extract. Different levels of syntax and semantic descriptions are available Description Scheme Specify the structure and relations among different A/V descriptors Description Definition Language (DDL) Standardized language based on XML (eXtended Markup Language) for defining new Ds and DSs; extending or modifying existing Ds and Dss. Content descriptions

Color space: HSV (hue-saturation-value) Scalable color descriptor (SCD): color histogram (uniform 255 bin) of an image in HSV encoded by Haar transform. Color layout descriptor: spatial distribution of color in an arbitrarily shaped region. Dominant color descriptor (DCD): colors are clustered first. Color structure descriptor (CSD): scan 8x8 block in slide window, and count particular color in window. Group of Frame/Group of Picture color descriptor Visual Color Descriptors

Texture Browsing D. Regularity: 0: irregular; 3: periodic Directionality Up to 2 directions 1-6 in 30O increment Coarseness 0: fine; 3: coarse Edge histogram D. 16 sub-images 5 (edge direction) bins/sub-image Homogeneous Texture D. (HTD) Divide frequency space into 30 bins (5 radial, 6 angular) 2D Gabor filter bank applied to each bin Energy and energy deviation in each bin computed to form descriptor. Visual Texture Descriptor

3D Shape D. – Shape spectrum Histogram (100 bins, 12bits/bin) of a shape index, computed over 3D surface. Each shape index measures local convexity. Region-based D.: Art Angular radial transform Shape analysis based on moments ART basis: Vnm(, ) = exp(jm)Rn() Rn() = 2 cos(n) n 0 = 1 n = 0 Contour based shape descriptor Curvature scale space (CSS) N points/curve, successively smoothed by [0.25 0.5 0.25] till curve become convex. Curvature at each point form a curvature at that scale. Peaks of each scale are used as feature 2D/3D descriptors Use multiple 2D descriptors to describe 3D shape Visual Shape Descriptor

Motion activity D. Intensity Direction of activity Spatial distribution of activity Temporal distribution of activity Camera motion Panning Booming (lift up) Tracking Tilting Zooming Rolling (around image center) Dollying (backward) Warping (w.r.t. mosaic) Motion trajectory Visual Motion Descriptor Motion region Video segment trajectory Mosaic Camera motion Parametric motion Warping parameter Motion activity

4 classes of audio signals Pure music Pure speech Pure sound effect Arbitrary sound track Audio descriptors Silence Ds: silencetype Sound effect Ds: Audio Spectrum Sound effect features Spoken content Ds: Speaker type Link type Extraction info type Confusion info type Timbre Ds: Instrument Harmonic instrument Percussive instrument Melody contour Ds Contour Meter beat MPEG-7 Audio Content Descriptors

Goal: To support potentially erroneous decoding extracted using an automatic speech recognition system for robust retrieval. Spoken content Header Word lexicon (vocabulary) Phone lexicon: IPA (international phonetic association. Alphabet) SAMPA (speech assessment method phonetic alphabet) Phone confusion statistics Speaker Spoken content lattice (word or phone) Lattice Node Word and phone link Spoken content description Speech waveform Audio processing MPEG-7 Encoder ASR lattice   Header IS P=0.7 BORE P=0.6 lattice HIS P=0.3

Multimedia information retrieval Create searchable archive of A/V materials, e.g. album, digital library Real world examples: call routing Technical support On-line manual Shopping Multimedia on demand Filtering Automated email sorter Personalized information portal Enhance low-level signal processing Coding and trans-coding Post-processing Use of Content Features

Query Module Retrieval Module Feature extraction Feature comparison Interactive Query Formation Browsing & Feedback User Output Content-based Retrieval Input Module Feature Database Feature extraction Image Database Multimedia data

Multimedia CBR System Design Issues • Requirement analysis • How the multimedia materials are to be used • Determines what set of features are needed. • Archiving • How should individual objects are stored? Granularity? • Indexing (query) and retrieving • With multi-dimensional indices, what is an effective and efficient retrieval method? • What is a suitable perceptually-consistent similarity measure? • User interface • Modality? Text or spoken language or others? • Interactive or batch? Will dialogue be available?

Multimedia Archiving • Facts: • Often in compressed format and needs large storage space • Content index will also occupy storage space • Issues • Granularity must match underlying file system • Logical versus physical segmentation • File allocation on file system must support multiple stream access and low latency

Index A very high dimensional binary vector Encoding of content features Text-based content can be represented with term vectors A/V content features can be either Boolean vectors or term vectors Retrieval Retrieval is a pattern classification problem Use index vector as the feature vector Classify each object as relevant and irrelevant to a query vector (template) A perceptually consistent similarity measure is essential Indexing and Retrieving

Term Vector Query • Each document is represented by a specific term vector • A term is a key-word or a phrase • A term vector is a vector of terms. Each dimension of the vector corresponding to a term. • Dimension of a term vector = total number of distinct terms. • Example: Set of terms = [tree, cake, happy, cry, mother, father, big, small] document = “Father gives me a big cake. I am so happy”, “mother planted a small tree” Term vectors: [ 0, 1, 1, 0, 0, 1, 1, 0], [1, 0, 0, 0, 1, 0, 0, 1]

Inverse Term Frequency Vector • A probabilistic term vector representation. • Relative Term Frequency (within a document) tf (t,d) = count of term t / # of terms in document d • Inverse document Frequency df(t) = total count of document/ # of doc contain t • Weighted term frequency dt = tf(t,d) · log [ df(t)] • Inverse document frequency term vector D = [d1, d2, … ]

ITF Vector Example Document 1: The weather is great these days. Document 2: These are great ideas Document 3: You look great Eliminate: The, is, these, are, you

Human Computer Interface Voice, gesture push button/key expression, eye Command HCI is a match-maker: Matching the needs of human and computers Sensation: visual audio, pressure smell: virtual environment Data

Basic HCI Design Principles • Consistency: Same command means the same thing • Intuition: Metaphor that is familiar to the user • Adaptability: Adapt to user’s skill, style • Economy: Use minimum efforts to achieve a goal • Non-intrusive: Do not decide for user without asking • Structure: Present only relevant information to user in a simple manner.

User Models • User Profiles: • Categorize users using features relevant to tasks • Static features: age, sex, etc. • Dynamic features: activity logs, etc. • Derived features: skill levels, preferences, etc. • Use of Profiles for HCI • Adaptation: Customize HCI for different category of users • Better understanding of user’s needs

Principles of Dialogue Design • Feedback: Always acknowledge user’s input • Status: Always inform users where are they in the system • Escape: Provide a graceful way to exit half way. • Minimal Work: Minimize amount of input user must provide • Default: Provide default values to minimize work • Help: Context sensitive help • Undo: Allow user to make unintentional mistake and correct it • Consistency:

Document retrieval problem is a hypothesis testing problem: H0: di is relevant to q (r=1) H1: di is irrelevant to q (r=0) Type I error (Pe1=P{r=0|H0}) Relevant but not retrieved. Type II error (Pe2 =P{r=1|H1}) : Irrelevant but retrieved. Contingency table for evaluating retrieval Precision Recall Curve P(recision) = w/(w+y) is a measure of specificity of the result R(ecall) = w/(w+x) is an indicator of completeness of the result. Operating curve Pe1 = x/(w+x) = 1 – R Pe2 = y/(y+z) = F(allout) Expected search length = average # of documents need to be examined to retrieve a given number of relevant documents. Subjective criteria Performance Evaluation

Example: MetaSEEk • MetaSEEk-A meta-search engine • Purpose: retrieving images • Method: Select and interface with multiple on-line image search engines • Search Principle: Performance of different query classes of search engines and their search options A. B. Benitez, M. Beigi, and S.-F. Chang,Using Relevance Feedback in Content-Based Image Metasearch, IEEE Internet Computing, Vol. 2, No. 4, pp. 59-69, July/August 1998

Basic idea of MetaSEEk • Classify the user queries into different clusters by their visual content. • Rank the different search engines according to their performance for the different classes of user queries • Select the search engines and search options according to their rank for the specific query cluster • Display the search results to User • Modify these performance according to the user feedback

Overview-Basic components of a meta-search engine

Content-Based Visual Query (1) • Advantage • Ease of creating, capturing and collecting digital imaginary • Approaches • Extract significant features (Color, Texture, Shape, Structure) • Organize Feature Vectors • Compute the closeness of the feature vectors • Retrieve matched or most similar images

Content-Based Visual Query (2)Improve Efficiency • Keyword-based search • Match images with particular subjects and narrow down the search scope • Clustering • Classify images into various categories based on their contents • Indexing • Applied to the image feature vectors to support efficient access to the database

Cluster the visual data • K-means algorithm • Simplicity • Reduced computation • Tamura algorithm (for text) • For Color, feature vector are calculated using the color histogram • Using Euclidean distance

Conceptual structure of the meta-search database.

Summary Text: email reading Image: caption generation Video: high-lights, story board Issues: Segmentation Clustering of segments Labeling clusters Associate with syntactic and semantic labels Filtering Same as retrieval: filter out irrelevant objects based on a given criterion (query) Often need to be performed based on content features E.g. filtering traffic accidents or law violations from traffic monitoring videos Multimedia summary and filtering

Different coding decisions based on low level content features coding mode (inter/intra selection) motion estimation Object based coding Encoding different regions (VOP) separately Using different coder for different types of regions Multiple abstraction layer coding An analysis/synthesis approach Synthesize low level contents from higher level abstraction E.g. texture synthesis Content based post-processing Identify content types and en synthesize low level content Content based Coding and Post-processing

Content Based Multimedia Signal Processing