1 / 56

Semantic Content based Modeling

Semantic Content based Modeling. Video semantics are captured and organized to support video retrieval Difficult to automate Relying on manual annotation Capable of supporting natural language like queries. Video Content Extraction. Other forms of information extraction can be employed:.

ciel
Download Presentation

Semantic Content based Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semantic Content based Modeling • Video semantics are captured and organized to support video retrieval • Difficult to automate • Relying on manual annotation • Capable of supporting natural language like queries.

  2. Video Content Extraction • Other forms of information extraction can be employed: • Close-captioned text • Speech recognition • Descriptive information from screenplay • Key frames that characterize a shot • These content information can be associated with the video story units.

  3. Existing Semantic-Level Models • Segmentation-based Models • Stratification-based Models • Temporal Coherent Models

  4. 90 0 10 30 35 70 Teacher entersclass Principal interruptsclass A student raises hand Break Empty class Segmentation-based Modeling • A video stream is segmented into temporally continuous segments • Each segment is associated with a description which could be natural text, keywords, or other kinds of annotation. • Disadvantages; • Lack of flexibility • Limited in representing semantics => Lack of flexibility => Limited capability of representing semantics

  5. Event 1 Event 5 Event 3 Event 2 Event 4 Stratification-based Modeling 70 0 5 15 20 30 35 60 85 90 We partition the contextual information into single events. Each event is associated with a video segment called a stratum. Strata can overlap or encompass each other.

  6. Event 4 Event 3 Event 2 Event 1 Video time Temporal Coherent • Each event is associated with a set of video segments where it happens. • More flexible in structuring video semantics. => More flexible in structuring video semantics

  7. Stratum • The concept of stratification can be used to assign descriptions to video footage. • - Each stratum refers to a sequence of video • frames. • - The strata may overlap or totally encompass • each other. Car wreck rescue mission Medics Victim In ambulance In stretcher Pulled free Siren Ambulance Video Frames Advantage: Allowing easy retrieval by keyword

  8. Video Algebra Goal: To provide a high-level abstraction that • models complex information associated with digital video data and • supports content-based access Strategy: • The algebraic video data model consists of hierarchical compositions of video expressions with high-level semantic descriptions • The video expressions are constructed using video algebra operations

  9. Presentation • In the algebraic video data model, the fundamental entity is a presentation. • A presentation is a multiwindow spatial, temporal, • and content combination of video segments. • Presentations are described by video expressions. • - The most primitive video expressioncreates a single-window presentation from a raw video segment. • - Compound video expressionare constructed from simpler ones using video algebra operations. a compound video expression video expression an algebraic video node a primitive video expression video expression video expression video expression raw video raw video Note: An algebraic video node provides a means of abstraction by which video expressions can be named, stored, and manipulated as units.

  10. Video Algebra Operations The video algebra operations fall into four categories: 1. Creation: defines the construction of video expressions from raw video. 2.Composition: defines temporal relationships between component video expressions. 3. Output: defines spatial layout and audio output for component video expressions. 4. Description: associates content attributes with a video expression.

  11. Composition The composition operations can be combined to produce complex scheduling definitions and constraints. create a video presentation raw video segment C1 = create Cnn.HeadlineNews.rv 10 30 C2 = create Cnn.HeadlineNews.rv 20 40 C3 = create Cnn.HeadlineNews.rv 32 65 D1 = (description C1 “Anchor speaking”) D2 = (description C2 “Professor Smith”) D3 = (description C3 “Economic reform”) D3 follows D2 which follows D1, and common footages are not repeated. (It creates a non-redundant video stream from three overlapping segments.) C1 C3 C2 Anchor speaking Professor Smith Economic reform

  12. Composition Operators (1) • E1 E2: defines the presentation where E2follows E1 • E1È E2: defines the presentation where E2 follows E1and common footage is not repeated. • E1Ç E2: defines the presentation where only common footage of E1 and E2 is played. • E1 - E2: defines the presentation where only footage of E1 that is not in E2 is played. • E1 || E2: E1 and E2 are played concurrently and terminate simultaneously. • (test)? E1:E2:...:En: Ei is played if test evaluates to i. • loop E1 time: defines a repetition of E1 for a duration of time • stretchE1 factor: sets the duration of the presentation equal to factor times duration of E1 by changing the playback speed of the video segment. • limit E1 time: sets the duration of the presentation equal to the minimum of time and the duration of E1, but the playback speed is not changed.

  13. Composition Operators (2) • transition E1 E2type time: defines type transition effect between E1 and E2; time defines the duration of the transition effect The transition type is one of a set of transition effects, such as dissolve, fade, and wipe. • contains E1 query: defines the presentation that contains component expressions of E1 that match query. A query is a Boolean combination of attributes: Example: text: smith and text: question

  14. Descriptions • description E1 content: specifies that E1 is described by content. • a content is a Boolean combination of attributes that consists of a field name and a value. • some field names have predefined semantics (e.g., title), while other fields are user-definable. • values can assume a variety of types, including strings and video node names. • field names or values do not have to be unique within a description. • hide-content E1: defines a presentation that hides the content of E1 (i.e.., E1 does not contain any description). • This operation provides a method for creating abstraction barriers for content-based access. Example: title = “CNN Headline News”

  15. Output Characteristics • Video expressions include outputcharacteristics that specify the screen layout and audio output for playing back children streams. • Since expressions can be nested, the spatial layout of any particular video expression is defined relative to the parent rectangle. • window E1 (X1 , Y1) - (X2 , Y2)priority specifies that E1 will be displayed with priority in the window defined by the top-left corner (X1 , Y1) and the bottom-right corner (X2 , Y2) such that XiÎ[0, 1]and YiÎ[0, 1]. Window priorities are used to resolve overlap conflicts of screen display. • audio E1 channel force priority specifies that the audio of E1 will be output to channel with priority; if force is true, then the audio operation overrides any channel specifications of the component video expressions.

  16. Output Characteristics: An example C1 = create MavericksvsBulls.rv 30:0 50:0 P1 = window C1 (0, 0) - (0.5, 0.5) 10 P2 = window C1 (0, 0.5) - (0.5, 1) 20 P3 = window C1 (0.5, 0.5) - (1, 1) 30 P4 = window C1 (0.5, 0) - (1, 0.5) 40 P5 = (P1 || P2 || P4) P6 = (P1 || P2 || P3 || P4) (P5 || (window (P5 ||(window P6 (0.5, 0.5) - (1, 1) 60)) (0.5, 0.5) - (1, 1) 50)) larger means higher priority bottom-right top-left 0 P1 P4 P1 P4 P2 P1 P4 P2 P2 P3

  17. Scope of a video node description Thescopeof a given algebraic video node description is the subgraph that originates from the node. • The components of a video expression inherit descriptions by context. • All the content attributes associated with some parent video nodes are also associated with all its descendant nodes.

  18. Content-Based Access • Search query: Search a collection of video nodes for video expressions that match query. • Strategy: Matching a query to the attributes of an expression must take into account all of the attributes of that expression including the attributes of its encompassing expressions. • Example: searchtext: smith AND text: question Smith on economic reform  This is the result of the query Smith Anchor O Question from audience O This node also satisfies the query but is not returned because it’s a descendant of a node already in the result set. Question Raw video

  19. Browsing and Navigation Playback presentation Plays back the video expression. It enables the user to view the presentation defined by the expressions. Display video-expression Display the video expression. It allows the user to inspect the video expression. Get-parent video-expression Returns the set of nodes that directly point to video-expression. Get-children video expression Returns the set of nodes that video- expressions directly points to.

  20. Algebraic Video System Prototype • The Algebraic Video System is a prototype implementation of the algebraic video data model and its associated operations. • The implementation is build on top of three existing subsystems: • The VuSystem is used for managing raw video data and for its support of Tcl (Tool command language) programming. It provides an environment for recording, processing, and playing video. • The Semantic File System is used as a storage subsystem with content-based access to data for indexing and retrieving files that represent algebraic video nodes. • The WWW server provides a graphicalinterface to the system that includes facilities for querying, navigating, video editing and composing, and invoking the video player.

  21. Multimedia Objects in Relational Databases • The most straightforward and fundamental support of multimedia data types in a RDBMS is the ability to declare variable-length fields in the tables. • Some of the names of variable-length bit or character string used in commercial products include: VARCHAR BLOB TEXT IMAGE CHARACTER VARYING /* SQL92 */ VARGRAPHIC LONG RAW BYTE VARYING BIT VARYING /* SQL92 */ • Some systems have maximal variable-length field as small as 256 bytes. Other systems allow field values as large as 2 GBytes.

  22. BLOBs in InterBase • InterBase stores BLOBs in collections of segments. A segment in InterBase can be thought of as a fixed-length “page” or I/O block. • InterBase provides special API calls to retrieve and modify the segments. open-BLOB opens the BLOB for reading get-segment reads the next segment create-BLOB opens the BLOB for writes or updates put-segment saves the changes to the BLOB • Users can specify the length of each segment.

  23. IMAGE & TEXT in Sybase’s SQL Server • TEXT and IMAGE data types are supported in Sybase’s TransactSQL, which is an enhanced version of the SQL standard. • TEXT and IMAGE data types can be as large as 2 GBytes. • Internally, TEXT and IMAGE column values contain pointers to the first page of a linked list of pages. • Some of the functions supported: PATINDEX(“pattern”, column): returns the starting position of the first occurrence of the “pattern” in the column. TEXTPTR(“column”): returns a pointer to the variable length field.

  24. OODBs and Multimedia Applications Object-oriented databases are more suitable for multimedia application development. • Better complex object support: By their nature, many multimedia database applications, such as compound documents, need complex object support. • Extensibility and ability to add new types (classes):Users can add new types and extend the existing class hierarchies to address the specific needs of the multimedia application. • Better concurrency control and transaction model support: Transaction concepts such as long transactions and nested transactions are important for multimedia applications.

  25. Multimedia Data Types in UniSQL/X • UniSQL/X supports a class hierarchy rooted at generalized large object(GLO) class. • GLO class serves as the root of multimedia data type classes and provides a number of built-in attributes and methods. • For the content of GLO objects the user can create either a Large Object (LO) or a File Based Object (FBO). • LOs can only be accessed through UniSQL/X. • FBOs are stored in the host file system. The database stores a reference or a path for each FBO. • In addition to the base class GLO, UniSQL/X supports subclasses of GLO for specific multimedia data types: • Audio class • Image class

  26. Programming Multimedia Applications • An application is considered to be a multimedia object. • An application object uses or consists of many Basic Multimedia Objects (BMOs) and Compound Multimedia Objects (CMOs). • The specification of an object includes • binding information to a file • methods • event-driven processing (e.g., displaying the last image if the video ends before the audio). • The use of methods and events allows the application to create a script which express the interactions of different objects precisely and relatively simply.

  27. A Multimedia-Program Example

  28. Multimedia Information Retrieval (and Indexing) • Multimedia information retrieval : • deals with the storage, retrieval, transport and presentation of different types of multi-media data (e.g., images, video clips, audio clips, texts,…) • real need for managing multimedia data including their retrieval • Multimedia information retrieval in general: • retrieval process: • queries • indexing the medias • matching media and query representations

  29. MMDBMS and Retrieval: What is that ? First attempt for a clearer meaning • Example: an insurance company’s accident claim report as a multimedia object: it includes: • images (or video) of the accident • insurance forms with structured data • audio recordings of the parties involved in the accident • text report of the insurance company’s representative • Multimedia databases store structured data and unstructured data • Multimedia retrieval systems must retrieve structured and unstructured data

  30. MMDBMS and Retrieval (cont.) • Retrieval of structured data from databases: • typically handled by a Database Management System (DBMS) • DBMS provides a query language (e.g., Structured Query Language, SQL for the relational data model) • deterministic matching of query and data • Retrieval of unstructured data from databases: • typically handled by Information Retrieval (IR) system • similarity matching of uncertain query and document representations • result: list of documents according to relevance

  31. MMDBMS and Retrieval (cont.) • Multimedia database management systems should combine the Database Management System (DBMS) and information retrieval (IR) technology: • data modeling capabilities of DBMSs with the advanced and similarity based query capabilities of IR systems • Challenge = finding a data model that ensures: • effective query formulation and document representation • efficient storage • efficient matching • effective delivery

  32. MMDBMS and Retrieval (cont.) • Query formulation: • must accommodate information needs of users of multimedia systems • Document representations and their storage: • an appropriate modeling of the structure and content of the wide range of data of many different formats (= indexing) -> XML ? -> MPEG-7 • cf. dealing with thousands of images, documents, audio and video segments, and free text • at the same time modeling of physical properties for: • compression/ decompression, synchronization, delivery -> MPEG-21

  33. MMDBMS and Retrieval (cont.) • Matching of query and document representations: • taking into account the variety of attributes and their relationships of query and document representations • combination of exact matching of structured data with uncertain matching of unstructured data • Delivery of data: • browsing, retrieval • temporal constraints of video and audio presentation • merging of data from different sources (e.g., in medical networks)

  34. MMDBMS Queries 1) As in many retrieval systems, the user has the opportunity to browse and navigate through hyperlinks with querying: need of: • topic maps • summary descriptions of the multimedia objects 2) Queries specifying the conditions of the objects of interest • idea of multimedia query language: • should provide predicates for expressing conditions on the attributes, structure and content (semantics) of multimedia objects

  35. MMDBMS Queries (cont.) • attribute predicates: • concern the attributes of multimedia objects with an exact value (cf. traditional DB attributes): • e.g., date of a picture, name of a show • structural predicates: • temporal predicates to specify temporal synchronization: • for continuous media such as audio and video • for expressing temporal relationships between the frame representations of a single audio or video • e.g., “Find all the objects in which a jingle is playing for the duration of an image display”

  36. MMDBMS Queries (cont.) • spatial predicates to specify spatial layout properties for the presentation of multimedia objects: examples of predicates: contain, is contained in, intersect, is adjacent to e.g., “Find all the objects containing an image overlapping the associated text” • temporal and spatial predicates can be combined : e.g., “ Find all the objects in which the logo of the car company is displayed, and when it disappears, a graphic (showing the increase in the company sales) is shown in the same position where the logo was” • temporal and spatial predicates can: refer to whole objects refer to subcomponents of objects: with data model that supports complex object representation

  37. MMDBMS Queries (cont.) • semantic predicates: • concern the semantic and unstructured content of the data involved • represented by the features that have been extracted and stored for each multimedia object • e.g.,”Find all the objects containing the word OFFICE or “Find all red houses” • uncertainty, proximity and weights can be expressed in query • multimedia query language: • structured language • users do not formulate queries in this language, but enter query conditions by means of interfaces • natural language queries? • interface translates query to correct query syntax

  38. MMDBMS Queries 3) Query by example: • e.g., video, audio • the query is composed by picking an example and choosing the features the object must comply with • e.g., in a graphical user interface (GUI): users chooses image of a house and domain features for the query: “Retrieve all houses of similar shape and different color” • e.g., music: recorded melody, note sequence being entered by Musical Instruments Digital Interface (MIDI) 4) Question-answering? • e.g., questioning video images: “How many helicopters were involved in the attack on Kabul of December 20, 2001? “

  39. MMDBMS Example: Oracle’s interMedia • Enables Oracle 9i to manage rich content, including images, audio, and video information in an integrated fashion with other traditional business data. • interMedia can parse, index, and store rich content, develop content rich Web applications, deploy rich content on the Web, and tune Oracle9i content repositories. • interMedia enables data management services to support the rich data types used in electronic commerce catalogs, corporate repositories, Web publishing, corporate communications and training, media asset management, and other applications for internet, intranet, extranet, and traditional application in an integrated fashion • http://technet.oracle.com

  40. MMDBMS Indexing • Remember: Indexing and Retrieval Systems. • Indexing = assigning or extracting features that will be used for unstructured and structured queries (refers unfortunately often only to low-level features) • Often also segmentation: detection of retrieval units • Two main approaches: • manual: • segmentation • indexing= naming of objects and their relationships with key terms (natural language or controlled language) • automatic analysis: • identify the mathematical characteristics of the contents • different techniques depending on the type of multimedia source (image, text, video, or audio) • possible manual correction

  41. Indexing multimedia and features • multimedia object: typically represented as set of features (e.g., as vector of features) • features can be weighted (expressing uncertainty or significance of its value) • can be stored and searched in an index tree • Features have to embedded with the semantic content

  42. Indexing images • Automatic indexing of images: • segmentation in homogeneous segments: • homogeneity predicate defines the conditions for automatically grouping the cells • e.g., in a color image, cells that are adjacent to one another and whose pixel values are close are grouped into a segment • indexing: recognition of objects: simple patterns: • recognition of low level features : color histograms, textures, shapes (e.g., person, house), position • appearance features often not important in retrieval

  43. Indexing audio • Automatic indexing of audio: • segmentation into sequences (= basic units for retrieval): often manually • indexing: • speech recognition and indexing of the resulting transcripts (cf. indexing written text retrieval) • acoustic analysis (e.g., sounds, music, songs: melody transcription: note encoding, interval and rhythm detection and chords information): translated into string • e.g., key melody extraction: Tseng, 1999

  44. Scene Segmentation based on Audio Information • Short Time Energy (STE) is a reliable indicator for silence detection. • Zero-Crossing Rate (ZCR) is a useful feature to characterize different non-silence audio signals (especially discern unvoiced speech ) • Pitch (P value) is the fundamental frequency of an audio waveform • Spectrum Flux (SF) is defined as the average variation value of spectrum between two adjacent frames in a short-time analysis window to discriminate speech and environmental sound

  45. Indexing video • Automatic indexing of video: • segment: basic unit for retrieval • objects and activities identified in each video segment: can be used to index the segment • segmentation: • detection of video shot breaks, camera motions • boundaries in audio material (e.g., other music tune, changes in speaker) • textual topic segmentation of transcripts of audio and of close-captions (see below) • heuristic rules based on knowledge of: • type-specific schematic structure of video (e.g., documentary, sports) • certain cues: appearance of anchor person in news =>new topic

  46. An example of indexing • Learning of textual descriptions of images from surrounding text (Mori et al., 2000): • training: • images segmented in image parts of equal size • feature extraction for each image part (by quantization): • 4 x 4 x 4 RGB color histogram • 8 directions x 4 resolutions intensity histogram • words that accompany the image are inherited by each image part: • words are selected from the text of the document that contains the image by selecting nouns and adjectives that occur with a frequency above a threshold • cluster similar image parts based on their extracted features: • single-pass partitioning algorithm with minimum similarity threshold value

  47. An example of indexing • for each word and each cluster is estimated: P(wi|cj) as where mji = total frequency of word wi in cluster cj Mj = total frequency of all words in cj • testing: • unknown image is divided into parts and image features are extracted • for each part, the nearest cluster is found as the cluster whose centroid is most similar to the part • the average likelihood of all the words of the nearest clusters is computed • k words with largest average likelihood are chosen to index the new image (in example k = 3)

  48. source Mori et al.

  49. source Mori et al.

More Related