750 likes | 1.02k Views
The MPEG-7 Multimedia Content Description Interface. Αναστασία Μπολοβίνου, Υ/Δ Ινστιτούτου Πληροφορικής και Τηλεπικοινωνιών Ε.Κ.Ε.Φ.Ε ΔΗΜΟΚΡΙΤΟΣ. Outline. MPEG-7 motivation and scope Visual Descriptors (color, texture, shape) MPEG-7 retrieval evaluation criterion
E N D
The MPEG-7 MultimediaContent Description Interface Αναστασία Μπολοβίνου, Υ/Δ Ινστιτούτου Πληροφορικής και Τηλεπικοινωνιών Ε.Κ.Ε.Φ.Ε ΔΗΜΟΚΡΙΤΟΣ
Outline • MPEG-7 motivation and scope • Visual Descriptors (color, texture, shape) • MPEG-7 retrieval evaluation criterion • Similarity measures and MPEG-7 visual descriptors • Building MPEG-7 Descriptors and Descriptors Schemes with Description Definition Language • MPEG-7 VXM current state • Towards MPEG-7 Query Format Framework (Queries and visual descriptor tools employed by the queries) • Summary
MPEG-7 motivation and design scenarios (possible queries) sports Standardize multimedia metadata descriptions (facilitate multimedia content-based retrieval) for various types of audiovisual information • Music/audio: play a few notes and return music with similar music/audio • Images/graphics: draw a sketch and return images with similar graphics • Text/keywords: find AV material with subject corresponding to a keyword • Movement: describe movements and return video clips with the specified temporal and spatial relations • Scenario: describe actions and return scenarios where similar actions take place news Scientific content Consumer content Proliferation of audio-visual content Digital art galleries Recorded material
Scope of the Standard Description Production (extraction) Standard Description Description Consumption Normative part of MPEG-7 standard * MPEG-7 does not specify (non normative parts of MPEG-7): - How to extract descriptions(feature extraction, indexing process,annotation & authoring tools,...) • - How to use descriptions (search engine, filtering tool, retrieval process, browsing device, ...) • - The similarity between contents ->The goal is to define the minimum that enables interoperability.
Visual Descriptors (Normative, basic, for localization) • Localization • Region Locator • Spatio-Temporal • Locator • Color Descriptors Dominant Color Scalable Color Color Layout Color Structure GoF/GoP Color • Texture Descriptors Homogeneous Texture Texture Browsing Edge Histogram • Shape Descriptors Region Shape Contour Shape 3D Shape • Motion Descriptors for Video • Camera Motion • Motion Trajectory • Parametric Motion • Motion Activity Other Face Recognition
Color Descriptors Color Descriptors Dominant Color Scalable Color - HSV space Color Structure -HMMD space Color Layout -YCbCr space GroupOfFrames/ Pictures • Color Space: - R, G, B - Y, Cr, Cb - H, S, V - Monochrome - Linear transformation of R, G, B - HMMD Constrained color spaces: ->Scalable Color Descriptor uses HSV ->Color Structure Descriptor uses HMMD
Scalable Color Descriptor (CSD) • A color histogram in HSV color space • Encoded by Haar Transform • Feature vector: {NoCoef, NoBD, Coeff[..], CoeffSign[..]}
SCD extraction to 11bits/bin to 4bits/bin Nbits/bin (#bin<256)
GoF/GoP Color Descriptor • Extends Scalable Color Descriptor for a video segment or a group of pictures (joint color hist. is then possessed as CSD- Haar transform encoding) • Extraction • Histograms Aggregation methods: • Average ..but sensitivity to outliers (lighting changes, occlusion, text overlays) • Median ..increased comp. complexity for sorting • Intersection ..differs: a “least common” color trait viewpoint
Dominant Color Descriptor (DCD) • Clustering colors into a small number of representative colors (salient colors) • F = { {ci, pi, vi}, s} • ci : Representative colors • pi : Their percentages in the region • vi : Color variances • s : Spatial coherency
DCD Extraction (based on Lloyd gen. algorithm) +spatial coherency: Average number of connecting pixels of a dominant color using 3x3 masking window ci centroid of cluster ; x(n) color vector at pixel; v(n) perceptual weight for pixel . H.V.P more sensitive to smooth regions
http://debut.cis.nctu.edu.tw/Demo/ContentBasedVideoRetrieval/CBVR/Dominant/index.htmlhttp://debut.cis.nctu.edu.tw/Demo/ContentBasedVideoRetrieval/CBVR/Dominant/index.html
Color Layout Descriptor (CLD) • Clustering the image into 64 (8x8) blocks • Deriving the average color of each block (or using DCD) • Applying (8x8)DCT and encoding • Efficient for • Sketch-based image retrieval • Content Filtering using image indexing . . . … . . … . . .
CLD extraction • > derived average colors are transformed into a series of coefficients by performing DCT (data in time domain - > data in frequency domain). • > A few low-frequency coefficients are selected using zigzag scanning and quantized to form a CLD (large quantization step in quantizing AC coef / small quantization step in quantizing DC). • ->The color space adopted for CLD is YCrCb. If the time domain data is smooth (with little variation in data) then frequency domain data will make low frequency data larger and high frequency data smaller. F ={CoefPattern, YDCCoef,CbDCCoef,CrDCCoef,YACCoef, CbACCoef, CrACCoef}
Color Structure Descriptor (CSD) • Scanning the image by an 8x8 struct. element • Counting the number of blocks containing each color • Generating a color histogram (HMMD/4CSQ operating points)
CSD extraction If Then sub sampling factor p is given by: F = {colQuant, Values[m]}
Texture Descriptors • Homogenous Texture Descriptor • Non-Homogenous Texture Descriptor (Edge Histogram) • Texture Browsing
Homogenous Texture Descriptor (HTD) • Partitioning the frequency domain into 30 channels (modeled by a 2D-Gabor function) • Computing the energy and energy deviation for each channel • Computing mean and standard variation of frequency coefficients - > F = {fDC, fSD, e1,…, e30, d1,…, d30} • An efficient implementation: • Radon transform followed by Fourier transform
HTD Extraction –How to get 2-D frequency layout following the HVS 2-D image f(x,y) Radon transform 1D P(R, θ) 1D F(P (R, θ)) Resulted sampling grid in polar coords
- > 2D-Gabor Function deployed to define Gabor filter banks HTD Extraction - Data sampling in feature channel • It is a Gaussian weighted sinusoid • It is used to model individual channels • Each channel filters a specific type of texture
HTD properties One can perform • Rotation invariance matching • Intensity invariance matching (fCD removed from the feature vector) • Scale-Invariant matching F = {fDC, fSD, e1,…, e30, d1,…, d30}
Texture Browsing Descriptor -> Same sp. filtering procedure as the HTD.. regularity (periodic to random) e.g look for textures that are very regular and oriented at 300 Scale and orientation selective band-pass filters Coarseness (grain to coarse) Directionality (/300) ->the texture browsing descriptor can be used to find a set of candidates with similar perceptual properties and then use the HTD to get a precise similarity match list among the candidate images.
Edge Histogram Descriptor (EHD) • Represents the spatial distribution of five types of edges • vertical, horizontal, 45°, 135°, and non-directional • Dividing the image into 16 (4x4) blocks • Generating a 5-bin histogram for each block • It is scale invariant … Retain strong edges by thresholding canny edge operator • F = {BinCounts[k]} ,k=80
EHD extraction Extended (150 bins) Basic (80 bins) global basic Semi- global . +13 clusters for semi-global Egde map image using “Canny” edge operator
ETD valuation • Cannot be used for object-based image retrieval • Thedgeif set to 0 ETD applies for binary edge images (sketch-based retrieval) • Extended HTD achieves better results but does not exhibits rotation invariant property
Shape Descriptors • Region-based Descriptor • Contour-based Shape Descriptor • 2D/3D Shape Descriptor • 3D Shape Descriptor
Region-based Descriptor (RBD) • Expresses pixel distribution within a 2-D object region • Employs a complex 2D-Angular Radial Transformation (ART) • F = {MagnitudeOfART[k]} ,k=nxm m = 0, ..12 n = 0, ..3
Region-based Descriptor (2) • Applicable to figures (a) – (e) • Distinguishes (i) from (g) and (h) • (j), (k), and (l) are similar • Advantages: • Describes complex shapes with disconnected regions • Robust to segmentation noise • Small size • Fast extraction and matching
Contour-Based Descriptor (CBD) • It is based on Curvature Scale-Space representation
Curvature Scale-Space • Finds curvature zero crossing points of the shape’s contour (key points) • Reduces the number of key points step by step, by applying Gaussian smoothing • The position of key points are expressed relative to the length of the contour curve
CBD Extraction Repetitive smoothing of X and Y contour coordinates by the low-pass kernel (0.25, 0,5, 0,25) until the contour becomes convex Filtering pass ycss Location xCSS of curvature zero-crossing points • F = {NofPeaks, GlobalCurv[ecc][circ], PrototypeCurv[ecc][circ], HighestPeakY, peakX[k], peakY[k]}
CBD Applicability • Applicable to (a) • Distinguishes differences in (b) • Find similarities in (c) - (e) • Advantages: • Captures the shape very well • Robust to the noise, scale, and orientation • It is fast and compact
Comparison (RB/CB descriptors) • Blue: Similar shapes by Region-Based • Yellow: Similar shapes by Contour-Based
How MPEG-7 compare descriptors? ANMRR (average modified retrieval rank): Traditional metric • normalized measures that take into account different sizes of ground truth sets and the actual ranks obtained from the retrieval were defined -> • retrievals that miss items are assigned a penalty.
Similarity between features • Typically descriptors: multidimensional vectors (of low level features) • Similarity of two images in the vector feature space: – the range query: all the points within a hyperrectangle aligned with the coordinate axes – the nearest-neighbour or within-distance (α−cut) query: a particular metric in the feature space – dissimilarity between statistical distributions: the same metrics or specific measures
An example of CBIR system using HTD performing range query and NN query • http://nayana.ece.ucsb.edu/M7TextureDemo/Demo/client/M7TextureDemo.html
Criticism on MPEG-7 distance measures • MPEG-7 adopts feature vector space distances based on geometric assumptions of descriptor space, e.g ..but these quantitative measures (low-level information) do not fit ideally with human similarity perception ->researchers from other areas have developed alternative predicate-based models (descriptors are assumed to contain just binary elements in opposition to continuous data) which express the existence of properties and express high level information See “Pattern difference” : K:NofPredicates in the data vectors Xi, Xj b: property exists in Xi c: property exists in Xj
How to build and deploy an MPEG-7 Description A description A Description Scheme (structure) . + in DDLanguage A set of Descriptor Values (instantiation of a Descriptor for a given data set) MPEG-7 Description Tools are a library of standardized Descriptions and Description Schemes Adopting the XML Schema as the basis for the MPEG-7 DDL and the resulting XML-compliant instances (Descriptions in MPEG-7 textual format) eases interoperability by using a common, generic and powerful (+ extensible) representation format
Mpeg7 support for vectors, matrices and typed references How that works • Description Definition Language: • >XML Schema (flexibility) • - XMLS struct.lang.components • - XMLS datatype lang.components • - mpeg-7 spesific extentions • + • - >Binary version (efficiency) (XML) Text format BiM format mix
Descriptions enabled by the MPEG-7 tools Content description Tools • Additional info for organizing, managing and accessing the content: • - How objs are related and gathered in collections • summaries/variations/transcoding to support efficient browsing • - User interaction info • Archival-oriented Descriptions: • content’s creation/production • info on using the content • info on storing and representing the content Perceptual Descriptions: - content’s spatio-temporal structure - info on low-level features - semantic info related to the reality captured by the content Organization/Naviga-tion/Access/ User Interaction Tools Content management Tools
<Mpeg7> <Description xsi:type=“ContentEntity”> <MultimediaContentxsi:type=“VideoType”> <Video id=“video_example”> <MediaInformation>...</MediaInformation> <TemporalDecomposition gap=“false” overlap=“false”> <VideoSegment id=“VS1”> <MediaTime> <MediaTimePoint> T00:00:00</MediaTimePoint> <MediaDuration>PT2M</MediaDuration> </MediaTime> <VisualDescriptor xsi:type=“GoFGoPColorType” aggregation=“average”> <ScalableColor numOfCoef=“8” numOfBitplanesDicarded=“0”> <Coeff>1 2 3 4 5 6 7 8</Coeff> </ScalableColor> </VisualDescriptor> </VideoSegment> …… … </VideoSegment> </TemporalDecompostion> </Video> </MultimediaContent> </Description> </Mpeg7>
What DS to choose..? MPEG-7 provides DSs for description of the structureand semanticsof AV content + content management Cont.Manag. Info can be attached to individual Segments
Time • Mosaic • Annotation • Time • Color • Motion • Texture • Shape • Annotation Structure description Video Segment Segment decomposition Video Segments Segment decomposition Moving region Segments decomposition Moving regions Relation Link above
Segment Decomposition time connectivity
Content structural aspects (Segment DS tree) Annotate the whole image with StillRegion Spatial segmentation at different levels Among different regions we could use SegmentRelationship description tools
Content structural aspects (Segment Relationship DS graph) Temporal segments