Ajay Divakaran MERL - Mitsubishi Electric Research Labs Murray Hill, NJ

Video Indexing and Summarization using Combinations of the MPEG-7 Motion Activity Descriptor with other MPEG-7 audio-visual descriptors Ajay Divakaran MERL - Mitsubishi Electric Research Labs Murray Hill, NJ

Outline • Introduction • MPEG-7 Standard • Motivation for proposed techniques • Video Summarization using Motion Activity • Audio Assisted Video Summarization • Principal Cast Detection with MPEG-7 Audio Features • Automatic generation of Sports Highlights • Target Applications • Personal Video Recorder • Demonstration • Initial work on Video Mining • Conclusion

Team • Yours Truly • Kadir A. Peker – Colleague and Ex-Doctoral Student • Regunathan Radhakrishnan – Current Doctoral Student • Romain Cabasson – Summer Intern • Ziyou Xiong – Summer Intern and Current Collaborator • Padma Akella – Initial Demo designer and developer • Pradubkiat Bouklee – Initial Software developer

MPEG-7 Objectives • To develop a standard to identify and describe the multimedia content • Formal name: Multimedia Content Description Interface • Enable quick access to desired content whether local or not

MPEG-7: Key Technologies and Scope Description consumption Description Production

MPEG-7 and other Standards Emphasis on Subjective Representation Emphasis on Semantic Conveyance Rate MPEG-2 Studio, DTV Hybrid Content Interactive TV, Video Conferencing Indexing Retrieving Browsing MPEG-4 SNHC Object-Based MPEG-7 Descriptors MPEG-1 H.263 JPEG JPEG-2000 Visualization Abstract Representation Virtual Reality Functionality

MPEG-7 framework • MPEG-7 standardizes: • Descriptors (Ds): representations of features • to describe various types of features of multimedia information • to define the syntax and the semantics of each feature representation • Description Schemes (DSs) • to specify pre-defined structures and semantics of descriptors and their relationship • Description Definition Language (DDL) • to allow the creation of new DSs and, possibly, Ds and to allows the extension and modification of existing DSs – XML MPEG-7 Schema

MPEG-7 Motion Activity Descriptor • Feature Extraction from Video • Uncompressed Domain • Color Histograms - Zhang et al • Motion Estimation - Kanade et al • Compressed Domain • DC Images - Yeo et al, Kobla et al • Motion Vector Based - Zhang et al • Bit Allocation - Feng et al, Divakaran et al

Motivation for Compressed Domain Extraction • Compressed domain feature extraction is fast. • Block-matched motion vectors are sufficient for gross description. • Motion vector based calculation can be easily normalized w.r.t. encoding parameters.

Motivation for Descriptor • Need to capture “pace” or Intensity of activity • For example, draw distinction between • “High Action” segments such as chase scenes. • “Low Action” segments such as talking heads • Emphasize simple extraction and matching • Use Gross Motion Characteristics thus avoiding object segmentation, tracking etc. • Compressed domain extraction is important

Proposed Motion Activity Descriptor • Attributes of Motion Activity Descriptor • Intensity/Magnitude - 3 bits • Spatial Characteristics - 16 bits • Temporal Characteristics - 30 bits • Directional Characteristics - 3 bits

MPEG-7 Intensity of Motion Activity • Expresses “pace” or Intensity of Action • Uses scale of 1-5, very low - low - medium - high - very high • Extracted by suitably quantizing variance of motion vector magnitude • Motion Vectors extracted from compressed bitstream • Successfully tested with subjectively constructed Ground Truth

Video Summarization using Motion Activity • Video sequence V:{f1, f2, … fN} set of temporally ordered frames • Any temporally ordered subset of V is a summary • Previous work: Color dominant • Cluster frames based on image similarity • Select representative frames from clusters

Motion Activity as Summarizability • Hypothesis: • Motion activity measures intensity of motion • hence it measures change in the video • Therefore it indicates Summarizability • Test of the Hypothesis • Examine relationship between Fidelity of Summary and motion activity • Results show close correlation and motivate novel summarization strategy

Fidelity of a Summary

Test of Hypothesis • Segment the test sequence into shots • Use the first frame of each shot as its Key-Frame (KF) • Compute the fidelity of each key-frame as described • Compute the motion activity of each shot • For each MPEG-7 motion activity threshold • Identify shots that have the same or lower motion activity • Find the percentage p of shots with unacceptable fidelity (>0.2) • Plot p vs the MPEG-7 motion activity thresholds

Motion Activity as a Measure of Summarizability

Conclusions from Experiment • The percentage of shots with unacceptable fidelity grows monotonically with motion activity • In other words, as motion activity grows, the shots become increasingly difficult to summarize • Hence, motion activity is a direct indicator of summarizability • Question: Is the first frame the best choice as a key-frame?

Optimal Key-Frame Selection Using Motion Activity • Summarizability is an indication of change in the shot • The cumulative motion activity is therefore an indication of the cumulative change in the shot

Optimal Key-Frame Extraction Using Motion Activity

Comparison with Opt. Fidelity KF

Optimal Key-Frame Selection Based on Cumulative Motion Activity

Audio Assisted Video Browsing: Motivation • Baseline MHL visual summarization works well only when semantic segment boundaries are well defined • Semantic segment boundaries cannot be located easily using visual features alone • Audio is a rich source of content semantics • Should use audio features to locate semantic segment boundaries

Past Work • Principal Cast Identification using Audio – Wang et al • Topic Detection using Speech Recog. – Hanjalic etc • Semantic Scene Segmentation using Audio – Sundaram et al • Past work has emphasized classification of audio into crisp categories • We would like both a crisp categorization and a feature vector that allows softer classification • Generalized Sound Recognition Framework – Casey et al • Casey’s work provides a rich audio-semantic framework for our research

MPEG-7 Feature Extraction for Generalized Sound Recognition

Our approach to Principal Cast Detection MPEG-7 Generalized Sound Recognition State Duration Histograms Our Enhancement Principal Cast

Proposed Audio-Assisted Video Browsing Framework

Audio-Assisted Video Browsing Framework

MHL application of Casey’s approach to News Video Browsing • Classify the audio segments of the news video into speech and non-speech categories in first pass • Classify the speech segments into male and female speech • Using K-means clustering find the “principal” speakers in each category • The occurrence of each of the principal speakers provides a natural semantic boundary • Apply baseline visual summarization technique to semantic segments obtained above • There is thus a two-level summarization of the news video

Clustering Results for Male Principal Cast

Results and Challenges • Moderate accuracy so far. • Results are thus promising but not satisfactory • Lack of noise robustness and content dependence of training process represent major hurdle • Currently working on eliminating such problems through extensive training • Feature extraction too complex – currently investigating compressed domain audio feature extraction • Also examining alternative architectures that preserve basic spirit of framework

Automatic Extraction of Sports Highlights • Rapid Sports Highlights extraction is critical • Past work has made use of color, camera motion etc. • MPEG-7 Motion Activity Descriptor is simple • Can use it to extract high action segments for example • Should be useful in highlight extraction

Essential Strategy • Sports are governed by a set of rules • Key events lead to surges and dips in motion activity (perceived motion) • Thus, for a given sport, we can look for certain temporal patterns of motion activity that would indicate an interesting event • In sports highlights, the emphasis is on key-events and not on key-frames

Motion Activity Curve • Shot Detection not meaningful for our purpose • Compute motion activity (avg. mag. Of mv’s) for each P-frame • Smooth the values using a 10 point MA filter followed by a median filter • Quantize into binary levels of high and low motion using threshold • Low threshold for Golf, High for Soccer

Activity Curves for Golf

Activity Curve for Soccer

Highlights extraction : Golf • Play consists of long stretches of low activity interspersed with bursts of interesting high activity • Look for rising edges in the quantized motion activity curve • Concatenate ten second segments beginning at each of the points of interest marked above • The concatenation forms the desired summary

Highlights Extraction: Soccer • Play consists of long stretches of high activity • Interesting events lead to non-trivial stops in play leading to a short stretch of low MA • Thus we look for falling edges followed by a non-trivially long stretch of low motion activity • We are able to find the interesting events this way but have many false alarms • With our interface false alarms are easy to skip

Strengths and Limitations of Our Approach • The extraction is rapid and can be done in real time • We use an adaptively computed threshold that is suited to the content • An interface such as ours helps skip false alarms easily • There are too many false alarms

Current Approach to Extraction of Soccer Highlights

Summary of Sports Highlights Generation • Motion Activity provides a quick way to generate sports highlights • We use a different strategy with each sport • The simplicity of the technique allows real-time tuning of thresholds to modify highlights • Interactive interfaces enable effective use

PVR: Personal Video Recorder Local Storage Feature Extraction & MPEG-7 Indexing Video Codec Browsing & Summarization Enhanced User Interface With Massive Amounts of Locally Stored Content, Need to Locate & Customize Content According to User

Blind Summarization – A Video Mining Approach to Video Summarization Ajay Divakaran and Kadir A. Peker Mitsubishi Electric Research Laboratories Murray Hill, NJ

Content Mining • What is Data Mining? • It is the discovery of patterns and relationships in data. • Makes heavy use of statistical learning techniques such as regression and classification • Has been successfully applied to numerical data • Application to multimedia content is the next logical step • Most applicable to stored surveillance video and home video since patterns are not known a priori • Should enable anomalous event detection leading to highlight generation • Not applicable at first glance to consumer video

Content Mining vs. Typical Data Mining • Commonalities • Large data sets. Video is well known to produce huge volumes of data • Amenable to statistical analysis – Many of the machine learning tools work well with both kinds of data as can be seen in the literature and our research as well • Differences • Number of features not necessarily as large as conventional data mining data sets • Size of dataset not necessarily as large as conventional data mining data sets • Popular data mining techniques such as CART may not be directly applicable and may need modification • In summary, new mining techniques that retain the basic philosophy while customizing the details will have to be developed

Summarization cast as a Content Mining Problem • DVD “Auto-Summarization” mode inspires “blind Summarization” • Content Summarization can be cast as follows: • Classify segments into common and uncommon events without necessarily knowing the domain • Common patterns – what this video is about • Rare patterns – possibly interesting events • May help to categorize video, detect style... • The Summary is then a combination of common and rare events • Can hybridize with domain-dependent techniques

Data Mining Basics • Associations • Time series similarity • Sequential patterns • Clustering • “How does region A and B differ”, “Any anomaly in A”, “What goes with item x” • Marketing, molecular biology, etc.

Associations • A set of items i1..im; a set of transactions containing subset of items; a database of transactions: • Rule X  Y (X, Y items) : • Support s: s% of transactions have X,Y together • Confidence c: c% of the time buying X implies buying Y • Improvement: Ratio of P(X,Y) to P(X)*P(Y) • Find all rules with support, confidence and improvement larger than specified thresholds. • Continuous-valued extension exists

Some Basic Aspects • Unsupervised learning • Similar to clustering vs. classification • Estimation of joint probability density • Find values of (i1,i2,…,in) where P(i1, i2,…,in) is high

Ajay Divakaran MERL - Mitsubishi Electric Research Labs Murray Hill, NJ

Ajay Divakaran MERL - Mitsubishi Electric Research Labs Murray Hill, NJ

Presentation Transcript

Hill-Murray Pioneer Girls Lacrosse 2013

Hill-Murray Pioneer Girls Lacrosse 2012

Mitsubishi Electric Research Labs (MERL) Cambridge, MA

Toyohiro Tsurumaru (Mitsubishi Electric Corporation)

Mitsubishi Electric PA - Process Automation

Milan Vojnović EPFL Joint work with Matthew Andrews Bell Laboratories, Murray Hill, NJ

Ramesh Raskar Mitsubishi Electric Research Labs (MERL) Cambridge, MA, USA

Murray Hill Middle School

Mitsubishi Electric Distributors in Delhi

Rug Cleaning Murray Hill

Rug Repair & Restoration Murray Hill

Rug Cleaning Murray Hill

Ramesh Raskar Mitsubishi Electric Research Labs (MERL) Cambridge, MA, USA

Mitsubishi Electric Research Labs (MERL) Cambridge, MA

IDEANIX/ MeRL

Mitsubishi Electric W003T21971

Flower Delivery Murray Hill