1 / 36

Movie Content Analysis, Indexing and Skimming

This article discusses the objective of content-based video analysis and the limitations of supervised identification. It explores the use of integrated media data for better analysis, indexing, and skimming of video content. The article also covers approaches for event extraction and speaker identification in movies.

smithscott
Download Presentation

Movie Content Analysis, Indexing and Skimming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Movie Content Analysis, Indexing and Skimming 김덕주(Duck Ju Kim)

  2. Problems • What is the objective of content-based video analysis? • Why supervised identification has limitation? • Why should use integrated media data?

  3. Introduction • Analysis • Structured organization • Embedded semantics • Indexing • Tagging semantic units • Limited machine perception • Skimming • Abstraction & Presentation • Video browsing

  4. Event Detection Approach • Shot detection • Low-level structure • Not correspond directly to video semantics • Scene extraction • Higher-level context • Many unimportant contents • Event extraction • Higher semantic level • Better reveal, represent, abstraction

  5. Speaker Identification Approach • Standard speech databases • YOHO, HUB4, SWITCHBOARD • Integration from media cues • Speaker recognition + Facial analysis • Speech cues + Visual cues • Supervised Identification • Fixed speaker models • Insufficient training data • Data collection before processing

  6. Video Skimming Approach • Pre-developed schemes • Discontinuous semantic flow • Ignored embedded audio cue • Computation of six types of features • Importance evaluation • Assembling important events

  7. Content Pre-analysis • Shot detection • Color histogram-based approach • Extract keyframes • The first and last frames • Audio content • Classification • Silence, speech, music, environmental sounds • Visual content • Detect human faces

  8. Movie Event Extraction • Develop thematic topics • Through actions or dialogs • What to extract? • Two-speaker dialogs • Multiple-speaker dialogs • Hybrid Events

  9. Movie Event Extraction • How to extract? • Shot sink computation • Grouping close and similar shots • Sink clustering and characterization • Periodic, partly-periodic, non-periodic • Event extraction and classification • Post-processing

  10. Shot Sink Computation • Pool of close and similar shots • Using Visual Information • Window-based Sweep Algorithm

  11. Shot Sink Clustering • Clustering & Characterizing • Periodic, Partly-periodic, Non-periodic • Degree of shot repetition • Determining the sink periodicity • Calculate relative temporal distance • Compute mean μ, standard deviation σ • Grouping with K-means algorithm

  12. Integrating Speech & Face Information • False Alarm • Montage presentation -> Spoken Dialog • Multiple-speaker dialog -> Two-speaker dialog • Solution to reducing • Embedded audio information integration • Speech shot ratio calculation • Facial cue inclusion • Face detection

  13. Adaptive Speaker Identification • Shot detection & Audio classification • Face detection & Mouth tracking • Speech segmentation / clustering • Initial speaker modeling • Audiovisual-based speaker identification • Unsupervised speaker model adaptation

  14. Face Detection & Mouth Tracking • Detection & Recognition of talking faces • Distance between eyes and mouth : dist • Eyes’ position : (x1, y1), (x2, y2) • Mouth center : (x, y)

  15. Speech Segmentation

  16. Speech Clustering • Two separate segments X1, X2 • Joined segment X = {X1, X2} • For cluster C have n homogeneous speech segments Dist(X, C) = , • Negative value -> Considered from the same speaker

  17. Initial Speaker Modeling • Required for identification process • Exploiting the inter-relations between facial and speech cues • For each target cast member A • Find a speech shot where A is talking • Collect all the speech segments • Build initial model • Gaussian Mixture Model(GMM)

  18. Likelihood-basedspeaker identification • GMM model notation , j = 1, 2, …, m • For ith enrolled speaker • The log likelihood between X and Mi

  19. Audiovisual integrationfor speaker identification • Finalizing the speaker identification task • Integration of audio and video cues • Examine the existence of temporal overlap • Overlap ratio > Threshold • Assign face vector to cluster • Otherwise, set face vector to null • Speaker Identity

  20. UnsupervisedSpeaker Model Adaptation • Updating the speaker model • Three approaches • Average-based model adaptation • MAP-based model adaptation • Viterbi-based model adaptation

  21. Average-based Model Adaptation • Compute BIC distances • Compare between dmin and threshold T • dmin < T : • dmin > T : Initialize new mixture component • Update the weight for each component

  22. MAP-based Model Adaptation • μi : Mean of bid • Li: Occupation likelihood of the adaptation data • μ-bar : Mean of the observed adaptation data

  23. Viterbi-based Model Adaptation • Allows different feature vectors from different components • Hard decision • Any vector can either occupy component or not • Indicator function instead of probability function • Mixture component

  24. Event-based Movie Skimming • Event feature extraction • Six types of mid- to high-level features • Evaluation of importance • Movie skim generation • Assemble major events -> final skim

  25. Event Feature Extraction • Music Ratio • Speech Ratio • Sound Loudness • Action Level • Normalized by dividing the largest value • Present Cast • Theme Topic

  26. Event Feature Extraction • M : # of features extracted • N : # of events • ai,j : value of jth feature in ith event

  27. Movie Skim Generation • Choosing important events • User’s feature preference • Event importance vector

  28. Event Detection Results • Correctness of the event classification • System performance evaluation • Hybrid class excluded

  29. Speaker Identification Results • Evaluation of adaptive speaker identification system • False acceptance(FA) • False rejection(FR) • Identification accuracy(IA)

  30. Average-based, MAP-based, Viterbi-based

  31. Movie Skimming Results • Difficulties of Qualitative evaluation • Quantitative measure based on user study • 5-point scale : 1~5 • Visual comprehension • Audio comprehension • Semantic continuity • Good abstraction • Quick browsing • Video skipping

More Related