440 likes | 459 Views
Multimedia Segmentation and Summarization Dr. Jia-Ching Wang Honorary Fellow, ECE Department, UW-Madison. Outline. Introduction Speaker Segmentation Video Summarization Conclusion. What is Multimedia?. Image Video Speech Audio Text. Multimedia Everywhere.
E N D
Multimedia Segmentation and SummarizationDr. Jia-Ching WangHonorary Fellow, ECE Department, UW-Madison
Outline • Introduction • Speaker Segmentation • Video Summarization • Conclusion
What is Multimedia? • Image • Video • Speech • Audio • Text
Multimedia Everywhere • Fax machines: transmission of binary images • Digital cameras: still images • iPod / iPhone & MP3 • Digital camcorders: video sequences with audio • Digital television broadcasting • Compact disk (CD), Digital video disk (DVD) • Personal video recorder (PVR, TiVo) • Images on the World Wide Web • Video streaming, video conferencing • Video on cell phones, PDAs • High-definition televisions (HDTV) • Medical imaging: X-ray, MRI, ultrasound • Military imaging: multi-spectral, satellite, microwave
What is Multimedia Content? • Multimedia content: the syntactic and semantic information inherent in a digital material. • Example: text document • Syntactic content: chapter, paragraph • Semantic content: key words, subject, types of text document, etc. • Example: videodocument • Syntactic content: scene cuts, shots • Semantic content: motion, summary, index, caption, etc.
Why We Need to Know Multimedia Content? • Why we need to know multimedia content? • Information processing, in terms of archiving, indexing, delivering, accessing and other processing, require in-depth knowledge of content to optimize the performance.
Howto Know Multimedia Content? • How to Know Multimedia Content? • Multimedia content analysis • The computerized understanding of the semantic/syntactic of a multimedia document • Multimedia content analysis usually involves • Segmentation • Segmenting the multimedia document into units • Classification • Classifying each unit into a predefined type • Annotation • Annotatingthe multimedia document • Summarization • Summarizing the multimedia document
Multimedia Segmentation and Summarization • Multimedia segmentation • Syntactic content • Multimedia summarization • Semantic/syntactic content • The result of the temporal segmentation can benefit the video summarization
Multimedia Segmentation • Image segmentation • Video segmentation • Scene change, shot change • Audio segmentation • Audio class change • Speech segmentation • Speaker change detection • Text Segmentation • word segmentation, sentence segmentation, topic change detection
Multimedia Summarization • Image summarization • Region of interest • Video summarization • Storyboard, highlight • Audio summarization • Main theme in music, Corus in song, event sound in environmental sound stream • Speech summarization • Speech abstract • Text summarization • Abstract
What is Speaker Segmentation? • It can also be called speaker change detection (SCD) • Assumption: there is no overlapping between any of the two speaker streams speaker3 speaker2 speaker1
Supervised v.s. Unsupervised SCD • Supervised manner: acoustic data are made up of distinct speakers who are known a priori • Recognition based solution • Unsupervised manner: no prior knowledge about the number and identities of speakers • Metric-based criterion • Model selection-based criterion
x is a d-dimensional random vector. , i=1,…,M is the mixture weight. ,the mean vector. ,the covariance matrix. Supervised Speaker Segmentation-- Gaussian Mixture Model • Gaussian mixture modeling (GMM) • Incoming audio stream is classified into one of D classes in a maximum likelihood manner at time t
Unsupervised Speaker Segmentation-- Sliding Window Strategy & Detection Criterion • Metric-based criterion (The dissimilarities between the acoustic feature vectors are measured) • Kullback-Leibler distance • Mahalanobis distance • Bhattacharyya distance • Model selection-based criterion • Bayesian information criterion (BIC)
Bayesian Information Criterion • Model selection • Choose one among a set of candidate models Mi , i=1,2,...,m and corresponding model parameters to represent a given data set D = (D1, D2, …, DN). • Model Posterior Probability • Bayesian information criterion • Maximized log data likelihood for the given model with model complexity penalty • Bayesian information criterion of model Mi where di is the number of independent parameters in the mode parameter set
Unsupervised Segmentation Using Bayesian Information Criterion • First model • Second model • Bayesian information criterion
Disadvantages of Conventional Unsupervised Speaker Change Detection Disadvantage: • For metric based methods, it’s not easy to decide a suitable threshold • For BIC, it’s not easy to detect speaker segment less than 2 seconds
Proposed Method -- Misclassification Error Rate • Sliding window pairs • Feature vector distribution Same speaker Different speakers
Discussion • Generative and discriminant classifiers are both applicable • Key Point: Discriminant classifiers have the benefit that smaller data are required • We can have smaller scanning window size • The ability to detect short speaker change segment increases
Speaker Segmentation Using Misclassification Error Rate • Steps • Preprocessing • Framing, Feature extraction • Hypothesized speaker change point selection • Forcing 2-class labels • Training a discriminat hyperplane • Inside data recognition & calculating misclassification error rate • Accept/reject the hypothesized speaker change point • Significance • The unsupervised speaker segmentation problem is solved by supervised classification
Experimental Results EXPERIMENTAL RESULTS
Video Summarization • Dynamic v.s. Static Video Summarization • Dynamic video summarization • Sport highlight, movie trailer • Static video summarization • Storyboard • Visual-based approach • Incorporation of the semantic Information
Static Video Summarization-- Visual Based Approach • Example • Problem • Is the summarization ratio adjustable? • How to generate effective storyboard under a given summarization ratio?
How to Generate Effective Storyboard • Question: Assume there are n frames and the summarization ratio is r/n. How do we select the best r frames ? • Complexity: • There are C(n,r) different choices
How to Generate Effective Storyboard • In visual viewpoint • Most visually distinct frames should be extracted • Dissimality between two frames is measured by low level visual features • How to select best r frames from n frames • Solution: maximize the overall pairwise dissimilities • Complexity: C(n,r) x C(r,2) • Unfeasible: C(n,r) is usually huge • Fact • Human beings usually browse a storyboard in a sequential way • Optimal solution in a sequential sense • Maximize the sum of dissimilities from sequential adjacent images in a storyboard
How to Maximize the Dissimality Sum of the Extracted Images • Lattice-based representative frame extraction approach • Extract key component from temporal sequence • Dynamic programming can be applied • Example: how to select the best 4 images from an 8-image sequence
How to Maximize the Adjacent Dissimality Sum of the Extracted Images • Original images: O(1), O(2), O(3), O(4), O(5), O(6), O(7), O(8) • Extracted images: E(1), E(2), E(3), E(4) • E(1) ← O(i); E(1) ← O(j); E(1) ← O(k); E(1) ← O(l); where i < j < k < l • Each legal left-to-right path represents a way to extract images • Each transition results in an adjacent dissimality • In this example, the adjacent dissimality sum of the extracted images are D[ O(1),O(3) ] + D[ O(3),O(4) ] + D[ O(4),O(7) ]
How to Maximize the Adjacent Dissimality Sum of the Extracted Images
Complexity Comparison • Select 4 images from an 8-image sequence • Lattice-based approach • 45 dissimality comparison • Optimal approach • 420 dissimality comparison
Incorporation of the Semantic Information • Conventional • The static summarized images are extracted in accordance with low level visual features • Disadvantage • It’s difficult to catch the main story without the support of semantic significant information • We present a semantic based static video summarization • Each extracted image has an annotation • Related images are connected by edge • Using ‘who’ ‘what’ ‘where’ ‘when’ to list all extracted images
The Proposed Architecture • Shot annotation: mapping visual content to text • Concept expansion: It provides an alterative view and dependency information while measuring the relation of two annotations. • Relational graph construction
Concept Tree Construction • The concept tree denotes the dependent structure of the expanded words • Meronym • ‘Wheel' is a meronym of 'automobile'. • Holonym • ‘Tree' is a holonym of 'bark', of 'trunk' and of 'limb' • Pencil used for Draw • Salesperson location of Store • Motorist capable of Drive • Eat breakfast Effect of Full stomach
Concept Tree Reorganization • Who: names of people, subset of "person" in WordNet • Where: "social group," "building," and "location " in WordNet • What: " All the other words which do not belong to "who" and "where" • When: searching for time-period phrase
Relational Graph Construction -- Relation of Two Concept Trees • The relation of the two concept trees • The relation of the two roots • The relation of the two children
Relational Graph Construction -- Remove Unimportant Vertices and Edges • Remove edges with smaller weighting, i.e. lower relation • Remove vertices with smaller term frequency – inverse document frequency (TF-IDF)
The Final Relational Graph • Comparison with conventional storyboard
Conclusion • A novel speaker segmentation criterion is proposed • Misclassification error rate • The unsupervised speaker segmentation problem is solved by supervised classificationwith label-forcing • Discriminat classifier makes the proposed approach be able to have smaller scanning window size • The ability to detect short speaker change segment increases • Two new static video summarization approaches are proposed • Lattice-based representative frame extraction • Merely using low level visual features • The summarization ratio is adjustable • Under a given summarization ratio, the dissimality sum from sequential adjacent images is minimized • Concept-organizedrepresentative frame extraction • Incorporating semantic information • Mining the four kinds of concept entities: who, what, where, and when • People can efficiently grasp the comprehensive structure of the story and understand the main points of the contents
Future Work • Multimedia segmentation • Speech segmentation • Audio segmentation • Video segmentation • Multimedia summarization • Video summarization • Static, dynamic • Speech summarization • Audio summarization