640 likes | 653 Views
Video Indexing and Summarization using Combinations of the MPEG-7 Motion Activity Descriptor with other MPEG-7 audio-visual descriptors. Ajay Divakaran MERL - Mitsubishi Electric Research Labs Murray Hill, NJ. Outline. Introduction MPEG-7 Standard Motivation for proposed techniques
E N D
Video Indexing and Summarization using Combinations of the MPEG-7 Motion Activity Descriptor with other MPEG-7 audio-visual descriptors Ajay Divakaran MERL - Mitsubishi Electric Research Labs Murray Hill, NJ
Outline • Introduction • MPEG-7 Standard • Motivation for proposed techniques • Video Summarization using Motion Activity • Audio Assisted Video Summarization • Principal Cast Detection with MPEG-7 Audio Features • Automatic generation of Sports Highlights • Target Applications • Personal Video Recorder • Demonstration • Initial work on Video Mining • Conclusion
Team • Yours Truly • Kadir A. Peker – Colleague and Ex-Doctoral Student • Regunathan Radhakrishnan – Current Doctoral Student • Romain Cabasson – Summer Intern • Ziyou Xiong – Summer Intern and Current Collaborator • Padma Akella – Initial Demo designer and developer • Pradubkiat Bouklee – Initial Software developer
MPEG-7 Objectives • To develop a standard to identify and describe the multimedia content • Formal name: Multimedia Content Description Interface • Enable quick access to desired content whether local or not
MPEG-7: Key Technologies and Scope Description consumption Description Production
MPEG-7 and other Standards Emphasis on Subjective Representation Emphasis on Semantic Conveyance Rate MPEG-2 Studio, DTV Hybrid Content Interactive TV, Video Conferencing Indexing Retrieving Browsing MPEG-4 SNHC Object-Based MPEG-7 Descriptors MPEG-1 H.263 JPEG JPEG-2000 Visualization Abstract Representation Virtual Reality Functionality
MPEG-7 framework • MPEG-7 standardizes: • Descriptors (Ds): representations of features • to describe various types of features of multimedia information • to define the syntax and the semantics of each feature representation • Description Schemes (DSs) • to specify pre-defined structures and semantics of descriptors and their relationship • Description Definition Language (DDL) • to allow the creation of new DSs and, possibly, Ds and to allows the extension and modification of existing DSs – XML MPEG-7 Schema
MPEG-7 Motion Activity Descriptor • Feature Extraction from Video • Uncompressed Domain • Color Histograms - Zhang et al • Motion Estimation - Kanade et al • Compressed Domain • DC Images - Yeo et al, Kobla et al • Motion Vector Based - Zhang et al • Bit Allocation - Feng et al, Divakaran et al
Motivation for Compressed Domain Extraction • Compressed domain feature extraction is fast. • Block-matched motion vectors are sufficient for gross description. • Motion vector based calculation can be easily normalized w.r.t. encoding parameters.
Motivation for Descriptor • Need to capture “pace” or Intensity of activity • For example, draw distinction between • “High Action” segments such as chase scenes. • “Low Action” segments such as talking heads • Emphasize simple extraction and matching • Use Gross Motion Characteristics thus avoiding object segmentation, tracking etc. • Compressed domain extraction is important
Proposed Motion Activity Descriptor • Attributes of Motion Activity Descriptor • Intensity/Magnitude - 3 bits • Spatial Characteristics - 16 bits • Temporal Characteristics - 30 bits • Directional Characteristics - 3 bits
MPEG-7 Intensity of Motion Activity • Expresses “pace” or Intensity of Action • Uses scale of 1-5, very low - low - medium - high - very high • Extracted by suitably quantizing variance of motion vector magnitude • Motion Vectors extracted from compressed bitstream • Successfully tested with subjectively constructed Ground Truth
Video Summarization using Motion Activity • Video sequence V:{f1, f2, … fN} set of temporally ordered frames • Any temporally ordered subset of V is a summary • Previous work: Color dominant • Cluster frames based on image similarity • Select representative frames from clusters
Motion Activity as Summarizability • Hypothesis: • Motion activity measures intensity of motion • hence it measures change in the video • Therefore it indicates Summarizability • Test of the Hypothesis • Examine relationship between Fidelity of Summary and motion activity • Results show close correlation and motivate novel summarization strategy
Test of Hypothesis • Segment the test sequence into shots • Use the first frame of each shot as its Key-Frame (KF) • Compute the fidelity of each key-frame as described • Compute the motion activity of each shot • For each MPEG-7 motion activity threshold • Identify shots that have the same or lower motion activity • Find the percentage p of shots with unacceptable fidelity (>0.2) • Plot p vs the MPEG-7 motion activity thresholds
Conclusions from Experiment • The percentage of shots with unacceptable fidelity grows monotonically with motion activity • In other words, as motion activity grows, the shots become increasingly difficult to summarize • Hence, motion activity is a direct indicator of summarizability • Question: Is the first frame the best choice as a key-frame?
Optimal Key-Frame Selection Using Motion Activity • Summarizability is an indication of change in the shot • The cumulative motion activity is therefore an indication of the cumulative change in the shot
Optimal Key-Frame Selection Based on Cumulative Motion Activity
Audio Assisted Video Browsing: Motivation • Baseline MHL visual summarization works well only when semantic segment boundaries are well defined • Semantic segment boundaries cannot be located easily using visual features alone • Audio is a rich source of content semantics • Should use audio features to locate semantic segment boundaries
Past Work • Principal Cast Identification using Audio – Wang et al • Topic Detection using Speech Recog. – Hanjalic etc • Semantic Scene Segmentation using Audio – Sundaram et al • Past work has emphasized classification of audio into crisp categories • We would like both a crisp categorization and a feature vector that allows softer classification • Generalized Sound Recognition Framework – Casey et al • Casey’s work provides a rich audio-semantic framework for our research
Our approach to Principal Cast Detection MPEG-7 Generalized Sound Recognition State Duration Histograms Our Enhancement Principal Cast
MHL application of Casey’s approach to News Video Browsing • Classify the audio segments of the news video into speech and non-speech categories in first pass • Classify the speech segments into male and female speech • Using K-means clustering find the “principal” speakers in each category • The occurrence of each of the principal speakers provides a natural semantic boundary • Apply baseline visual summarization technique to semantic segments obtained above • There is thus a two-level summarization of the news video
Results and Challenges • Moderate accuracy so far. • Results are thus promising but not satisfactory • Lack of noise robustness and content dependence of training process represent major hurdle • Currently working on eliminating such problems through extensive training • Feature extraction too complex – currently investigating compressed domain audio feature extraction • Also examining alternative architectures that preserve basic spirit of framework
Automatic Extraction of Sports Highlights • Rapid Sports Highlights extraction is critical • Past work has made use of color, camera motion etc. • MPEG-7 Motion Activity Descriptor is simple • Can use it to extract high action segments for example • Should be useful in highlight extraction
Essential Strategy • Sports are governed by a set of rules • Key events lead to surges and dips in motion activity (perceived motion) • Thus, for a given sport, we can look for certain temporal patterns of motion activity that would indicate an interesting event • In sports highlights, the emphasis is on key-events and not on key-frames
Motion Activity Curve • Shot Detection not meaningful for our purpose • Compute motion activity (avg. mag. Of mv’s) for each P-frame • Smooth the values using a 10 point MA filter followed by a median filter • Quantize into binary levels of high and low motion using threshold • Low threshold for Golf, High for Soccer
Highlights extraction : Golf • Play consists of long stretches of low activity interspersed with bursts of interesting high activity • Look for rising edges in the quantized motion activity curve • Concatenate ten second segments beginning at each of the points of interest marked above • The concatenation forms the desired summary
Highlights Extraction: Soccer • Play consists of long stretches of high activity • Interesting events lead to non-trivial stops in play leading to a short stretch of low MA • Thus we look for falling edges followed by a non-trivially long stretch of low motion activity • We are able to find the interesting events this way but have many false alarms • With our interface false alarms are easy to skip
Strengths and Limitations of Our Approach • The extraction is rapid and can be done in real time • We use an adaptively computed threshold that is suited to the content • An interface such as ours helps skip false alarms easily • There are too many false alarms
Summary of Sports Highlights Generation • Motion Activity provides a quick way to generate sports highlights • We use a different strategy with each sport • The simplicity of the technique allows real-time tuning of thresholds to modify highlights • Interactive interfaces enable effective use
PVR: Personal Video Recorder Local Storage Feature Extraction & MPEG-7 Indexing Video Codec Browsing & Summarization Enhanced User Interface With Massive Amounts of Locally Stored Content, Need to Locate & Customize Content According to User
Blind Summarization – A Video Mining Approach to Video Summarization Ajay Divakaran and Kadir A. Peker Mitsubishi Electric Research Laboratories Murray Hill, NJ
Content Mining • What is Data Mining? • It is the discovery of patterns and relationships in data. • Makes heavy use of statistical learning techniques such as regression and classification • Has been successfully applied to numerical data • Application to multimedia content is the next logical step • Most applicable to stored surveillance video and home video since patterns are not known a priori • Should enable anomalous event detection leading to highlight generation • Not applicable at first glance to consumer video
Content Mining vs. Typical Data Mining • Commonalities • Large data sets. Video is well known to produce huge volumes of data • Amenable to statistical analysis – Many of the machine learning tools work well with both kinds of data as can be seen in the literature and our research as well • Differences • Number of features not necessarily as large as conventional data mining data sets • Size of dataset not necessarily as large as conventional data mining data sets • Popular data mining techniques such as CART may not be directly applicable and may need modification • In summary, new mining techniques that retain the basic philosophy while customizing the details will have to be developed
Summarization cast as a Content Mining Problem • DVD “Auto-Summarization” mode inspires “blind Summarization” • Content Summarization can be cast as follows: • Classify segments into common and uncommon events without necessarily knowing the domain • Common patterns – what this video is about • Rare patterns – possibly interesting events • May help to categorize video, detect style... • The Summary is then a combination of common and rare events • Can hybridize with domain-dependent techniques
Data Mining Basics • Associations • Time series similarity • Sequential patterns • Clustering • “How does region A and B differ”, “Any anomaly in A”, “What goes with item x” • Marketing, molecular biology, etc.
Associations • A set of items i1..im; a set of transactions containing subset of items; a database of transactions: • Rule X Y (X, Y items) : • Support s: s% of transactions have X,Y together • Confidence c: c% of the time buying X implies buying Y • Improvement: Ratio of P(X,Y) to P(X)*P(Y) • Find all rules with support, confidence and improvement larger than specified thresholds. • Continuous-valued extension exists
Some Basic Aspects • Unsupervised learning • Similar to clustering vs. classification • Estimation of joint probability density • Find values of (i1,i2,…,in) where P(i1, i2,…,in) is high