1 / 64

Ajay Divakaran MERL - Mitsubishi Electric Research Labs Murray Hill, NJ

Video Indexing and Summarization using Combinations of the MPEG-7 Motion Activity Descriptor with other MPEG-7 audio-visual descriptors. Ajay Divakaran MERL - Mitsubishi Electric Research Labs Murray Hill, NJ. Outline. Introduction MPEG-7 Standard Motivation for proposed techniques

gaild
Download Presentation

Ajay Divakaran MERL - Mitsubishi Electric Research Labs Murray Hill, NJ

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Video Indexing and Summarization using Combinations of the MPEG-7 Motion Activity Descriptor with other MPEG-7 audio-visual descriptors Ajay Divakaran MERL - Mitsubishi Electric Research Labs Murray Hill, NJ

  2. Outline • Introduction • MPEG-7 Standard • Motivation for proposed techniques • Video Summarization using Motion Activity • Audio Assisted Video Summarization • Principal Cast Detection with MPEG-7 Audio Features • Automatic generation of Sports Highlights • Target Applications • Personal Video Recorder • Demonstration • Initial work on Video Mining • Conclusion

  3. Team • Yours Truly • Kadir A. Peker – Colleague and Ex-Doctoral Student • Regunathan Radhakrishnan – Current Doctoral Student • Romain Cabasson – Summer Intern • Ziyou Xiong – Summer Intern and Current Collaborator • Padma Akella – Initial Demo designer and developer • Pradubkiat Bouklee – Initial Software developer

  4. MPEG-7 Objectives • To develop a standard to identify and describe the multimedia content • Formal name: Multimedia Content Description Interface • Enable quick access to desired content whether local or not

  5. MPEG-7: Key Technologies and Scope Description consumption Description Production

  6. MPEG-7 and other Standards Emphasis on Subjective Representation Emphasis on Semantic Conveyance Rate MPEG-2 Studio, DTV Hybrid Content Interactive TV, Video Conferencing Indexing Retrieving Browsing MPEG-4 SNHC Object-Based MPEG-7 Descriptors MPEG-1 H.263 JPEG JPEG-2000 Visualization Abstract Representation Virtual Reality Functionality

  7. MPEG-7 framework • MPEG-7 standardizes: • Descriptors (Ds): representations of features • to describe various types of features of multimedia information • to define the syntax and the semantics of each feature representation • Description Schemes (DSs) • to specify pre-defined structures and semantics of descriptors and their relationship • Description Definition Language (DDL) • to allow the creation of new DSs and, possibly, Ds and to allows the extension and modification of existing DSs – XML MPEG-7 Schema

  8. MPEG-7 Motion Activity Descriptor • Feature Extraction from Video • Uncompressed Domain • Color Histograms - Zhang et al • Motion Estimation - Kanade et al • Compressed Domain • DC Images - Yeo et al, Kobla et al • Motion Vector Based - Zhang et al • Bit Allocation - Feng et al, Divakaran et al

  9. Motivation for Compressed Domain Extraction • Compressed domain feature extraction is fast. • Block-matched motion vectors are sufficient for gross description. • Motion vector based calculation can be easily normalized w.r.t. encoding parameters.

  10. Motivation for Descriptor • Need to capture “pace” or Intensity of activity • For example, draw distinction between • “High Action” segments such as chase scenes. • “Low Action” segments such as talking heads • Emphasize simple extraction and matching • Use Gross Motion Characteristics thus avoiding object segmentation, tracking etc. • Compressed domain extraction is important

  11. Proposed Motion Activity Descriptor • Attributes of Motion Activity Descriptor • Intensity/Magnitude - 3 bits • Spatial Characteristics - 16 bits • Temporal Characteristics - 30 bits • Directional Characteristics - 3 bits

  12. MPEG-7 Intensity of Motion Activity • Expresses “pace” or Intensity of Action • Uses scale of 1-5, very low - low - medium - high - very high • Extracted by suitably quantizing variance of motion vector magnitude • Motion Vectors extracted from compressed bitstream • Successfully tested with subjectively constructed Ground Truth

  13. Video Summarization using Motion Activity • Video sequence V:{f1, f2, … fN} set of temporally ordered frames • Any temporally ordered subset of V is a summary • Previous work: Color dominant • Cluster frames based on image similarity • Select representative frames from clusters

  14. Motion Activity as Summarizability • Hypothesis: • Motion activity measures intensity of motion • hence it measures change in the video • Therefore it indicates Summarizability • Test of the Hypothesis • Examine relationship between Fidelity of Summary and motion activity • Results show close correlation and motivate novel summarization strategy

  15. Fidelity of a Summary

  16. Test of Hypothesis • Segment the test sequence into shots • Use the first frame of each shot as its Key-Frame (KF) • Compute the fidelity of each key-frame as described • Compute the motion activity of each shot • For each MPEG-7 motion activity threshold • Identify shots that have the same or lower motion activity • Find the percentage p of shots with unacceptable fidelity (>0.2) • Plot p vs the MPEG-7 motion activity thresholds

  17. Motion Activity as a Measure of Summarizability

  18. Conclusions from Experiment • The percentage of shots with unacceptable fidelity grows monotonically with motion activity • In other words, as motion activity grows, the shots become increasingly difficult to summarize • Hence, motion activity is a direct indicator of summarizability • Question: Is the first frame the best choice as a key-frame?

  19. Optimal Key-Frame Selection Using Motion Activity • Summarizability is an indication of change in the shot • The cumulative motion activity is therefore an indication of the cumulative change in the shot

  20. Optimal Key-Frame Extraction Using Motion Activity

  21. Comparison with Opt. Fidelity KF

  22. Optimal Key-Frame Selection Based on Cumulative Motion Activity

  23. Audio Assisted Video Browsing: Motivation • Baseline MHL visual summarization works well only when semantic segment boundaries are well defined • Semantic segment boundaries cannot be located easily using visual features alone • Audio is a rich source of content semantics • Should use audio features to locate semantic segment boundaries

  24. Past Work • Principal Cast Identification using Audio – Wang et al • Topic Detection using Speech Recog. – Hanjalic etc • Semantic Scene Segmentation using Audio – Sundaram et al • Past work has emphasized classification of audio into crisp categories • We would like both a crisp categorization and a feature vector that allows softer classification • Generalized Sound Recognition Framework – Casey et al • Casey’s work provides a rich audio-semantic framework for our research

  25. MPEG-7 Feature Extraction for Generalized Sound Recognition

  26. Our approach to Principal Cast Detection MPEG-7 Generalized Sound Recognition State Duration Histograms Our Enhancement Principal Cast

  27. Proposed Audio-Assisted Video Browsing Framework

  28. Audio-Assisted Video Browsing Framework

  29. MHL application of Casey’s approach to News Video Browsing • Classify the audio segments of the news video into speech and non-speech categories in first pass • Classify the speech segments into male and female speech • Using K-means clustering find the “principal” speakers in each category • The occurrence of each of the principal speakers provides a natural semantic boundary • Apply baseline visual summarization technique to semantic segments obtained above • There is thus a two-level summarization of the news video

  30. Clustering Results for Male Principal Cast

  31. Results and Challenges • Moderate accuracy so far. • Results are thus promising but not satisfactory • Lack of noise robustness and content dependence of training process represent major hurdle • Currently working on eliminating such problems through extensive training • Feature extraction too complex – currently investigating compressed domain audio feature extraction • Also examining alternative architectures that preserve basic spirit of framework

  32. Automatic Extraction of Sports Highlights • Rapid Sports Highlights extraction is critical • Past work has made use of color, camera motion etc. • MPEG-7 Motion Activity Descriptor is simple • Can use it to extract high action segments for example • Should be useful in highlight extraction

  33. Essential Strategy • Sports are governed by a set of rules • Key events lead to surges and dips in motion activity (perceived motion) • Thus, for a given sport, we can look for certain temporal patterns of motion activity that would indicate an interesting event • In sports highlights, the emphasis is on key-events and not on key-frames

  34. Motion Activity Curve • Shot Detection not meaningful for our purpose • Compute motion activity (avg. mag. Of mv’s) for each P-frame • Smooth the values using a 10 point MA filter followed by a median filter • Quantize into binary levels of high and low motion using threshold • Low threshold for Golf, High for Soccer

  35. Activity Curves for Golf

  36. Activity Curve for Soccer

  37. Highlights extraction : Golf • Play consists of long stretches of low activity interspersed with bursts of interesting high activity • Look for rising edges in the quantized motion activity curve • Concatenate ten second segments beginning at each of the points of interest marked above • The concatenation forms the desired summary

  38. Highlights Extraction: Soccer • Play consists of long stretches of high activity • Interesting events lead to non-trivial stops in play leading to a short stretch of low MA • Thus we look for falling edges followed by a non-trivially long stretch of low motion activity • We are able to find the interesting events this way but have many false alarms • With our interface false alarms are easy to skip

  39. Strengths and Limitations of Our Approach • The extraction is rapid and can be done in real time • We use an adaptively computed threshold that is suited to the content • An interface such as ours helps skip false alarms easily • There are too many false alarms

  40. Current Approach to Extraction of Soccer Highlights

  41. Summary of Sports Highlights Generation • Motion Activity provides a quick way to generate sports highlights • We use a different strategy with each sport • The simplicity of the technique allows real-time tuning of thresholds to modify highlights • Interactive interfaces enable effective use

  42. PVR: Personal Video Recorder Local Storage Feature Extraction & MPEG-7 Indexing Video Codec Browsing & Summarization Enhanced User Interface With Massive Amounts of Locally Stored Content, Need to Locate & Customize Content According to User

  43. Blind Summarization – A Video Mining Approach to Video Summarization Ajay Divakaran and Kadir A. Peker Mitsubishi Electric Research Laboratories Murray Hill, NJ

  44. Content Mining • What is Data Mining? • It is the discovery of patterns and relationships in data. • Makes heavy use of statistical learning techniques such as regression and classification • Has been successfully applied to numerical data • Application to multimedia content is the next logical step • Most applicable to stored surveillance video and home video since patterns are not known a priori • Should enable anomalous event detection leading to highlight generation • Not applicable at first glance to consumer video

  45. Content Mining vs. Typical Data Mining • Commonalities • Large data sets. Video is well known to produce huge volumes of data • Amenable to statistical analysis – Many of the machine learning tools work well with both kinds of data as can be seen in the literature and our research as well • Differences • Number of features not necessarily as large as conventional data mining data sets • Size of dataset not necessarily as large as conventional data mining data sets • Popular data mining techniques such as CART may not be directly applicable and may need modification • In summary, new mining techniques that retain the basic philosophy while customizing the details will have to be developed

  46. Summarization cast as a Content Mining Problem • DVD “Auto-Summarization” mode inspires “blind Summarization” • Content Summarization can be cast as follows: • Classify segments into common and uncommon events without necessarily knowing the domain • Common patterns – what this video is about • Rare patterns – possibly interesting events • May help to categorize video, detect style... • The Summary is then a combination of common and rare events • Can hybridize with domain-dependent techniques

  47. Data Mining Basics • Associations • Time series similarity • Sequential patterns • Clustering • “How does region A and B differ”, “Any anomaly in A”, “What goes with item x” • Marketing, molecular biology, etc.

  48. Associations • A set of items i1..im; a set of transactions containing subset of items; a database of transactions: • Rule X  Y (X, Y items) : • Support s: s% of transactions have X,Y together • Confidence c: c% of the time buying X implies buying Y • Improvement: Ratio of P(X,Y) to P(X)*P(Y) • Find all rules with support, confidence and improvement larger than specified thresholds. • Continuous-valued extension exists

  49. Some Basic Aspects • Unsupervised learning • Similar to clustering vs. classification • Estimation of joint probability density • Find values of (i1,i2,…,in) where P(i1, i2,…,in) is high

More Related