360 likes | 543 Views
Movie Content Analysis, Indexing and Skimming. 김덕주 (Duck Ju Kim). Problems. What is the objective of content-based video analysis? Why supervised identification has limitation? Why should use integrated media data?. Introduction. Analysis Structured organization Embedded semantics
E N D
Movie Content Analysis, Indexing and Skimming 김덕주(Duck Ju Kim)
Problems • What is the objective of content-based video analysis? • Why supervised identification has limitation? • Why should use integrated media data?
Introduction • Analysis • Structured organization • Embedded semantics • Indexing • Tagging semantic units • Limited machine perception • Skimming • Abstraction & Presentation • Video browsing
Event Detection Approach • Shot detection • Low-level structure • Not correspond directly to video semantics • Scene extraction • Higher-level context • Many unimportant contents • Event extraction • Higher semantic level • Better reveal, represent, abstraction
Speaker Identification Approach • Standard speech databases • YOHO, HUB4, SWITCHBOARD • Integration from media cues • Speaker recognition + Facial analysis • Speech cues + Visual cues • Supervised Identification • Fixed speaker models • Insufficient training data • Data collection before processing
Video Skimming Approach • Pre-developed schemes • Discontinuous semantic flow • Ignored embedded audio cue • Computation of six types of features • Importance evaluation • Assembling important events
Content Pre-analysis • Shot detection • Color histogram-based approach • Extract keyframes • The first and last frames • Audio content • Classification • Silence, speech, music, environmental sounds • Visual content • Detect human faces
Movie Event Extraction • Develop thematic topics • Through actions or dialogs • What to extract? • Two-speaker dialogs • Multiple-speaker dialogs • Hybrid Events
Movie Event Extraction • How to extract? • Shot sink computation • Grouping close and similar shots • Sink clustering and characterization • Periodic, partly-periodic, non-periodic • Event extraction and classification • Post-processing
Shot Sink Computation • Pool of close and similar shots • Using Visual Information • Window-based Sweep Algorithm
Shot Sink Clustering • Clustering & Characterizing • Periodic, Partly-periodic, Non-periodic • Degree of shot repetition • Determining the sink periodicity • Calculate relative temporal distance • Compute mean μ, standard deviation σ • Grouping with K-means algorithm
Integrating Speech & Face Information • False Alarm • Montage presentation -> Spoken Dialog • Multiple-speaker dialog -> Two-speaker dialog • Solution to reducing • Embedded audio information integration • Speech shot ratio calculation • Facial cue inclusion • Face detection
Adaptive Speaker Identification • Shot detection & Audio classification • Face detection & Mouth tracking • Speech segmentation / clustering • Initial speaker modeling • Audiovisual-based speaker identification • Unsupervised speaker model adaptation
Face Detection & Mouth Tracking • Detection & Recognition of talking faces • Distance between eyes and mouth : dist • Eyes’ position : (x1, y1), (x2, y2) • Mouth center : (x, y)
Speech Clustering • Two separate segments X1, X2 • Joined segment X = {X1, X2} • For cluster C have n homogeneous speech segments Dist(X, C) = , • Negative value -> Considered from the same speaker
Initial Speaker Modeling • Required for identification process • Exploiting the inter-relations between facial and speech cues • For each target cast member A • Find a speech shot where A is talking • Collect all the speech segments • Build initial model • Gaussian Mixture Model(GMM)
Likelihood-basedspeaker identification • GMM model notation , j = 1, 2, …, m • For ith enrolled speaker • The log likelihood between X and Mi
Audiovisual integrationfor speaker identification • Finalizing the speaker identification task • Integration of audio and video cues • Examine the existence of temporal overlap • Overlap ratio > Threshold • Assign face vector to cluster • Otherwise, set face vector to null • Speaker Identity
UnsupervisedSpeaker Model Adaptation • Updating the speaker model • Three approaches • Average-based model adaptation • MAP-based model adaptation • Viterbi-based model adaptation
Average-based Model Adaptation • Compute BIC distances • Compare between dmin and threshold T • dmin < T : • dmin > T : Initialize new mixture component • Update the weight for each component
MAP-based Model Adaptation • μi : Mean of bid • Li: Occupation likelihood of the adaptation data • μ-bar : Mean of the observed adaptation data
Viterbi-based Model Adaptation • Allows different feature vectors from different components • Hard decision • Any vector can either occupy component or not • Indicator function instead of probability function • Mixture component
Event-based Movie Skimming • Event feature extraction • Six types of mid- to high-level features • Evaluation of importance • Movie skim generation • Assemble major events -> final skim
Event Feature Extraction • Music Ratio • Speech Ratio • Sound Loudness • Action Level • Normalized by dividing the largest value • Present Cast • Theme Topic
Event Feature Extraction • M : # of features extracted • N : # of events • ai,j : value of jth feature in ith event
Movie Skim Generation • Choosing important events • User’s feature preference • Event importance vector
Event Detection Results • Correctness of the event classification • System performance evaluation • Hybrid class excluded
Speaker Identification Results • Evaluation of adaptive speaker identification system • False acceptance(FA) • False rejection(FR) • Identification accuracy(IA)
Movie Skimming Results • Difficulties of Qualitative evaluation • Quantitative measure based on user study • 5-point scale : 1~5 • Visual comprehension • Audio comprehension • Semantic continuity • Good abstraction • Quick browsing • Video skipping