1 / 44

Outline

Multimedia Segmentation and Summarization Dr. Jia-Ching Wang Honorary Fellow, ECE Department, UW-Madison. Outline. Introduction Speaker Segmentation Video Summarization Conclusion. What is Multimedia?. Image Video Speech Audio Text. Multimedia Everywhere.

vparham
Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multimedia Segmentation and SummarizationDr. Jia-Ching WangHonorary Fellow, ECE Department, UW-Madison

  2. Outline • Introduction • Speaker Segmentation • Video Summarization • Conclusion

  3. What is Multimedia? • Image • Video • Speech • Audio • Text

  4. Multimedia Everywhere • Fax machines: transmission of binary images • Digital cameras: still images • iPod / iPhone & MP3 • Digital camcorders: video sequences with audio • Digital television broadcasting • Compact disk (CD), Digital video disk (DVD) • Personal video recorder (PVR, TiVo) • Images on the World Wide Web • Video streaming, video conferencing • Video on cell phones, PDAs • High-definition televisions (HDTV) • Medical imaging: X-ray, MRI, ultrasound • Military imaging: multi-spectral, satellite, microwave

  5. What is Multimedia Content? • Multimedia content: the syntactic and semantic information inherent in a digital material. • Example: text document • Syntactic content: chapter, paragraph • Semantic content: key words, subject, types of text document, etc. • Example: videodocument • Syntactic content: scene cuts, shots • Semantic content: motion, summary, index, caption, etc.

  6. Why We Need to Know Multimedia Content? • Why we need to know multimedia content? • Information processing, in terms of archiving, indexing, delivering, accessing and other processing, require in-depth knowledge of content to optimize the performance.

  7. Howto Know Multimedia Content? • How to Know Multimedia Content? • Multimedia content analysis • The computerized understanding of the semantic/syntactic of a multimedia document • Multimedia content analysis usually involves • Segmentation • Segmenting the multimedia document into units • Classification • Classifying each unit into a predefined type • Annotation • Annotatingthe multimedia document • Summarization • Summarizing the multimedia document

  8. Multimedia Segmentation and Summarization • Multimedia segmentation • Syntactic content • Multimedia summarization • Semantic/syntactic content • The result of the temporal segmentation can benefit the video summarization

  9. Multimedia Segmentation • Image segmentation • Video segmentation • Scene change, shot change • Audio segmentation • Audio class change • Speech segmentation • Speaker change detection • Text Segmentation • word segmentation, sentence segmentation, topic change detection

  10. Multimedia Summarization • Image summarization • Region of interest • Video summarization • Storyboard, highlight • Audio summarization • Main theme in music, Corus in song, event sound in environmental sound stream • Speech summarization • Speech abstract • Text summarization • Abstract

  11. What is Speaker Segmentation? • It can also be called speaker change detection (SCD) • Assumption: there is no overlapping between any of the two speaker streams speaker3 speaker2 speaker1

  12. Supervised v.s. Unsupervised SCD • Supervised manner: acoustic data are made up of distinct speakers who are known a priori • Recognition based solution • Unsupervised manner: no prior knowledge about the number and identities of speakers • Metric-based criterion • Model selection-based criterion

  13. x is a d-dimensional random vector. , i=1,…,M is the mixture weight. ,the mean vector. ,the covariance matrix. Supervised Speaker Segmentation-- Gaussian Mixture Model • Gaussian mixture modeling (GMM) • Incoming audio stream is classified into one of D classes in a maximum likelihood manner at time t

  14. Supervised Speaker Segmentation-- Hidden Markov Model

  15. Unsupervised Speaker Segmentation-- Sliding Window Strategy & Detection Criterion • Metric-based criterion (The dissimilarities between the acoustic feature vectors are measured) • Kullback-Leibler distance • Mahalanobis distance • Bhattacharyya distance • Model selection-based criterion • Bayesian information criterion (BIC)

  16. Bayesian Information Criterion • Model selection • Choose one among a set of candidate models Mi , i=1,2,...,m and corresponding model parameters to represent a given data set D = (D1, D2, …, DN). • Model Posterior Probability • Bayesian information criterion • Maximized log data likelihood for the given model with model complexity penalty • Bayesian information criterion of model Mi where di is the number of independent parameters in the mode parameter set

  17. Unsupervised Segmentation Using Bayesian Information Criterion • First model • Second model • Bayesian information criterion

  18. Disadvantages of Conventional Unsupervised Speaker Change Detection Disadvantage: • For metric based methods, it’s not easy to decide a suitable threshold • For BIC, it’s not easy to detect speaker segment less than 2 seconds

  19. Proposed Method -- Misclassification Error Rate • Sliding window pairs • Feature vector distribution Same speaker Different speakers

  20. Mathematical Analysis

  21. Mathematical Analysis

  22. Discussion • Generative and discriminant classifiers are both applicable • Key Point: Discriminant classifiers have the benefit that smaller data are required • We can have smaller scanning window size • The ability to detect short speaker change segment increases

  23. Speaker Segmentation Using Misclassification Error Rate • Steps • Preprocessing • Framing, Feature extraction • Hypothesized speaker change point selection • Forcing 2-class labels • Training a discriminat hyperplane • Inside data recognition & calculating misclassification error rate • Accept/reject the hypothesized speaker change point • Significance • The unsupervised speaker segmentation problem is solved by supervised classification

  24. Experimental Results EXPERIMENTAL RESULTS

  25. Video Summarization • Dynamic v.s. Static Video Summarization • Dynamic video summarization • Sport highlight, movie trailer • Static video summarization • Storyboard • Visual-based approach • Incorporation of the semantic Information

  26. Static Video Summarization-- Visual Based Approach • Example • Problem • Is the summarization ratio adjustable? • How to generate effective storyboard under a given summarization ratio?

  27. How to Generate Effective Storyboard • Question: Assume there are n frames and the summarization ratio is r/n. How do we select the best r frames ? • Complexity: • There are C(n,r) different choices

  28. How to Generate Effective Storyboard • In visual viewpoint • Most visually distinct frames should be extracted • Dissimality between two frames is measured by low level visual features • How to select best r frames from n frames • Solution: maximize the overall pairwise dissimilities • Complexity: C(n,r) x C(r,2) • Unfeasible: C(n,r) is usually huge • Fact • Human beings usually browse a storyboard in a sequential way • Optimal solution in a sequential sense • Maximize the sum of dissimilities from sequential adjacent images in a storyboard

  29. How to Maximize the Dissimality Sum of the Extracted Images • Lattice-based representative frame extraction approach • Extract key component from temporal sequence • Dynamic programming can be applied • Example: how to select the best 4 images from an 8-image sequence

  30. How to Maximize the Adjacent Dissimality Sum of the Extracted Images • Original images: O(1), O(2), O(3), O(4), O(5), O(6), O(7), O(8) • Extracted images: E(1), E(2), E(3), E(4) • E(1) ← O(i); E(1) ← O(j); E(1) ← O(k); E(1) ← O(l); where i < j < k < l • Each legal left-to-right path represents a way to extract images • Each transition results in an adjacent dissimality • In this example, the adjacent dissimality sum of the extracted images are D[ O(1),O(3) ] + D[ O(3),O(4) ] + D[ O(4),O(7) ]

  31. How to Maximize the Adjacent Dissimality Sum of the Extracted Images

  32. Complexity Comparison • Select 4 images from an 8-image sequence • Lattice-based approach • 45 dissimality comparison • Optimal approach • 420 dissimality comparison

  33. Segment-Based Solution

  34. Experimental Results

  35. Incorporation of the Semantic Information • Conventional • The static summarized images are extracted in accordance with low level visual features • Disadvantage • It’s difficult to catch the main story without the support of semantic significant information • We present a semantic based static video summarization • Each extracted image has an annotation • Related images are connected by edge • Using ‘who’ ‘what’ ‘where’ ‘when’ to list all extracted images

  36. The Proposed Architecture • Shot annotation: mapping visual content to text • Concept expansion: It provides an alterative view and dependency information while measuring the relation of two annotations. • Relational graph construction

  37. Concept Tree Construction • The concept tree denotes the dependent structure of the expanded words • Meronym • ‘Wheel' is a meronym of 'automobile'. • Holonym • ‘Tree' is a holonym of 'bark', of 'trunk' and of 'limb' • Pencil used for Draw • Salesperson location of Store • Motorist capable of Drive • Eat breakfast Effect of Full stomach

  38. Concept Tree Reorganization • Who: names of people, subset of "person" in WordNet • Where: "social group," "building," and "location " in WordNet • What: " All the other words which do not belong to "who" and "where" • When: searching for time-period phrase

  39. Relational Graph Construction -- Relation of Two Concept Trees • The relation of the two concept trees • The relation of the two roots • The relation of the two children

  40. Relational Graph Construction -- Remove Unimportant Vertices and Edges • Remove edges with smaller weighting, i.e. lower relation • Remove vertices with smaller term frequency – inverse document frequency (TF-IDF)

  41. The Final Relational Graph • Comparison with conventional storyboard

  42. Conclusion • A novel speaker segmentation criterion is proposed • Misclassification error rate • The unsupervised speaker segmentation problem is solved by supervised classificationwith label-forcing • Discriminat classifier makes the proposed approach be able to have smaller scanning window size • The ability to detect short speaker change segment increases • Two new static video summarization approaches are proposed • Lattice-based representative frame extraction • Merely using low level visual features • The summarization ratio is adjustable • Under a given summarization ratio, the dissimality sum from sequential adjacent images is minimized • Concept-organizedrepresentative frame extraction • Incorporating semantic information • Mining the four kinds of concept entities: who, what, where, and when • People can efficiently grasp the comprehensive structure of the story and understand the main points of the contents

  43. Future Work • Multimedia segmentation • Speech segmentation • Audio segmentation • Video segmentation • Multimedia summarization • Video summarization • Static, dynamic • Speech summarization • Audio summarization

  44. Thank all of you for your attendance!

More Related