Aristotle University of Thessaloniki Department of Informatics

Aristotle University of Thessaloniki Department of Informatics Visual Information Retrieval Constantine Kotropoulos Monday July 8, 2002

Outline Fundamentals Still image segmentation: Comparison of ICM and LVQ techniques Shape retrieval based on Hausdorff distance Video Summarization: Detecting shots, cuts, and fades in video – Selection of key frames MPEG-7: Standard for Multimedia Applications Conclusions

Fundamentals About Toward visual information retrieval Data types associated with images or video First generation systems Second generation systems Content-based interactivity Representation of visual content Similarity models Indexing methods Performance evaluation

Visual information retrieval: To retrieve images or image sequences from a database that are relevant to a query. Extension of traditional information retrieval designed to include visual media. Needs: Tools and interaction paradigms that permit searching for visual data by referring directly to its content. Visual elements (color, texture, shape, spatial relationships) related to perceptual aspects of image content. Higher-level concepts: clues for retrieving images with similar content from a database. Multidisciplinary field: Information retrieval  Image/video analysis and processing Visual data modeling and representation  Pattern recognition Multimedia database organization  Computer vision Multimedia database organization  User behavior modeling Multidimensional indexing  Human-computer interaction About

Databases allow a large amount of alphanumeric data to be stored in a local repository and accessed by content through appropriate query languages. Information Retrieval Systems provide access to unstructured text documents Search engines working in the textual domain either using keywords or full text. Need for Visual Information Retrieval Systems has become apparent when digital archives were released. distribution of image and video data though large-bandwidth computer networks emerged more prominent as we progress to the wireless era! Toward visual information retrieval

Query by image content using NOKIA 9210 Communicator www.iva.cs.tut.fi/COST211 Iftikhar et al.

Content-dependent metadata Data related in some way to image/video content (e.g., format, author’s name, date, etc.) Content-dependent metadata Low/intermediate-level features: color, texture, shape, spatial relationship, motion, etc. Data referring to content semantics (content-descriptive metadata) Impact on the internal organization of the retrieval system Data types associated with images or video

Answers to queries: Find All images of paintings of El Greco. All byzantine ikons dated from 13th century, etc. Content-independent metadata: alphanumeric strings Representation schemes: relational models, frame models, object-oriented. Content-dependent metadata: annotated keywords or scripts Retrieval: Search engines working in the textual domain (SQL, full text retrieval) Examples: PICDMS (1984), PICQUERY (1988), etc. Drawbacks: Difficult for text to capture the distinctive properties of visual features Text not appropriate for modeling perceptual similarity Subjective First generation systems

Supports full retrieval by visual content Conceptual level: keywords Perceptual level: objective measurements at pixel level Other sensory data (speech, sound) might help (e.g. video streams). Image processing, pattern recognition and computer vision are an integral part of architecture and operation Retrieval systems for 2-D still images Video 3-D images and video WWW Second generation systems

Content Perceptual properties: color, texture, shape, and spatial relationships Semantic primitives: objects, roles, and scenes Impressions, emotions, and meaning associated with the combination of perceptual features Basic retrieval paradigm: For each image a set of descriptive features are pre-computed Queries by visual examples The user selects the features, ranges of model parameters, and chooses a similarity measure The system checks the similarity between the visual content of the user’s query and database images. Objective: To keep the number of misses as low as possible. Number of false alarms? Interaction: Relevance feedback Retrieval systems for 2-D still images (1)

Similarity vs. matching Matchingis a binary partition operator: “Does the observed object correspond to a model or not?” Uncertainties are managed during the process Similarity-based retrieval: To re-order the database of images according to how similar are to a query example. Ranking not classification The user is in the retrieval loop; Need for a flexible interface. Retrieval systems for 2-D still images (2)

Video conveys information from multiple planes of communication How the frames are linked together usingediting effects(cuts, fades, dissolves, etc). What is in the frames (characters,story content, etc.) Each type of video (commercials, news, movies, sport) has its own peculiar characteristics. Basic Terminology Frame: basic unit of information usually samples at 1/25 or 1/30 of a second. Shot: A set of frames between a camera turn-on and a camera turn-off Clip: A set of frames with some semantic content Episodes: An hierarchy of shots; Scene: A collection of consecutive shots that share simultaneity is space, time, and action (e.g. a dialog scene). Video is accessed through browsing and navigation Retrieval systems for video (1)

Retrieval systems for video (2)

3-D images and video are available in biomedicine computer-aided design Geographic maps Painting Games and entertainment industry (immersive environments) Expected to flourish in the current decade Retrieval on the WWW: Distributed problem Need for standardization (MPEG-7) Response time is critical (work in the compressed domain, summarization) Retrieval systems for 3-D images and video / WWW

Visual interfaces Standards for content representation Database models Tools for automatics extraction of features from images and video Tools for extraction of semantics Similarity models Effective indexing Web search and retrieval Role of 3-D Research directions

Browsing offers a panoramic view of the visual information space Visualization Content-based interactivity www.virage.com

QBIC color layout http://wwwqbic.almaden.ibm.com/

Querying by content (1) For still images: • To check if the concepts expressed in a query match the concepts of database images: “find all Holy Ikons with a nativity” “find all Holy Ikons with Saint George” (object categories) Treated with free-text or SQL-based retrieval engines (Google) • To verify spatial relations between spatial entities “find all images with a car parked outside a house” • topological queries (disjunction, adjacency, containment, overlapping) • metric queries (distances, directions, angles) Treated with SQL-like spatial querylanguages

Querying by content (2) • To check the similarity of perceptual features (color, texture, edges, corners, and shapes) • exact queries: “find all images of President Bush” • range queries: “find all images with colors between green and blue” • K-nearest neighbor queries: find the ten most similar images to the example” For video: • Concepts related to video content • Motion, objects, texture, and color features of video: Shot extraction, dominant colors, etc.

Google

Ark of Refugee Heirloom www.ceti.gr/kivotos

Suited to express perceptual aspects of low/intermediate features of visual content. The user provides a prototype image as a reference example Relevance feedback: the user analyses the responses of the system and indicates, for each item retrieved the degree of relevance or the exactness of the ranking; the annotated results are fed back into the system to refine the query. Types of querying: Iconic (PN) : Suitable for retrieval based on high-level concepts By painting Employed in color-based retrieval (NETRA) By sketch (PICASSO) By image (NETRA) Querying by visual example

PICASSO/PN http://viplab.dsi.unifi.it/PN/

NETRA http://maya.ece.ucsb.edu/Netra/netra.html

Representation of visual content • Representation of perceptual features of images and video is a fundamental problem in visual information retrieval. • Image analysis and pattern recognition algorithms provide the means to extract numeric descriptors. • Computer vision enables object and motion identification • Representation of perceptual features • Color • Texture • Shape • Structure • Spatial relationships • Motion • Representation of content semantics • Semantic primitives • Semiotics

Representation of perceptual featuresColor (1)

Representation of perceptual features Color (2) • Human visual system: Responsible for color perception are the cones. • From psychological point of view, perception of color is related to several factors e.g., • color attributes (brightness, chromaticity, saturation) • surrounding colors • color spatial organization • observer’s memory/knowledge/experience • Geometric color models (RGB, HSV, Lab, etc.) • Color histogram: to describe the low-level color properties.

Image retrieval by color similarity (1) • Color spaces • Histograms; • Moments of distribution • Quantization of the color space • Similarity measures • L1 and L2 norm of the difference between the query histogram H(IQ) and the histogram of a database image H(ID)

Image retrieval by color similarity (2) • histogram intersection • weighted Euclidean distance

Representation of perceptual features Texture (1) • Texture: One level of abstraction above pixels. • Perceptual texture dimensions: • Uniformity • Density • Coarseness • Roughness • Regularity • Linearity • Directionality/Direction • Frequency • Phase Brodatz album

Representation of perceptual features Texture (2) • Statistical methods: • Autocorrelation function (coarseness, periodicity) • Frequency content [rings, wedges] Coarseness, Directionality, isotropic/non-isotropic patterns • Moments • Directional histograms and related features • Run-lengths and related features • Co-occurrence matrices • Structural methods (Grammars and production rules)

Representation of perceptual features Shape (1) • Criteria of a good shape representation • Each shape possesses a unique representation invariant to translation, rotation, and scaling. • Similar shapes should have similar representations • Methods to extract shapes and to derive features stem from image processing • Chain codes • Polygonal approximations • Skeletons • Boundary descriptors • contour length/ diameter • shape numbers • Fourier descriptors • Moments

Representation of perceptual features Shape (2) Polygonal approximation Chain codes (I. Pitas)

Representation of perceptual features Shape (3) a b c d Face segmentation: (a) original color image (b) skin segmentation. (c ) connected components (d) best fit-ellipses.

Representation of perceptual features Structure/Spatial relationships • Structure • To provide a Gestalt impression of the shapes in the image. • set of edges • corners • To distinguish photographs from drawings. • To classify scenes: portrait, landscape, indoor • Spatial relationships • Spatial entities: points, lines, regions, and objects • Relationships: • Directional (include a distance/angle measure) • Topological (do not include distance but they capture set-theoretical concepts e.g. disjunction) • They are represented symbolically.

Representation of perceptual features Motion • Main characterizing element in a sequence of frames • Related to change in the relative position of spatial entities or toa a camera movement. • Methods: • Detection of temporal changes of gray-level primitives (optical flow) • Extraction of a set of sparse characteristic features of the objects, such as corners or salient points and their tracking in subsequent frames. • Crucial role in video Salient features (Kanade et al.)

Representation of content semantics Semantic primitives • Identification of objects, roles, actions and events as abstractions of visual signs. • Achieved through recognition and interpretation • Recognition • To select a set of low-level local features and statistical pattern recognition for object classification • Interpretation is based on reasoning. • Domain-dependent e.g. Photobook (www-white.media.mit.edu) • Retrieval systems including interpretation: facial database systems to compare facial expressions

Representation of content semantics Semiotics • Grammar of color usage to formalize effects • Association of color hue, saturation, etc to psychological behaviors • Semiotics identifies two distinct steps for the production of meaning • Abstract level by narrative structures (e.g. camera breaks, colors, editing effects, rhythm, shot angle) • Concrete level by discourse structures: how the narrative elements create a story.

Similarity models • Pre-attentive: perceived similarity between stimuli • Color/texture/shape; • Models close to human perception • Attentive: • Interpretation • Previous knowledge and a form of reasoning • Domain-specific retrieval applications (mugshots); need for models and similarity criteria definition

Metric model (1) • Distance in a metric psychological • Properties of a distance function d: • Commonly used distance functions: • Euclidean • City-block • Minkowsky

Metric model (2) • Inadequacies: shape similarity • Advantages: • similarity judgment of color stimuli • consistent with pattern recognition and computer vision • suitable for creating indices • Other similarity models: • Virtual metric spaces • Tversky’s model: function of two types of features: those that are common to the two stimuli and those that exclusively appear to one only stimulus. • Transformational distances: elastic graph matching • User subjectivity?

Four eyes approach Self improving database browser and annotator based on user interaction Similarity is presented with groupings The system chooses in trees hierarchies those nodes which most efficiently represent the positive examples. Set-covering algorithm to remove all positive examples covered. Iterations

Indexing methods (1) • To avoid sequential scanning • Retrieved images are ranked in order of similarity to a query • Compound measure of similarity between visual features and text attributes. • Indexing of string attributes • Commonly used indexing techniques • Hashing tables and signatures • Cosine similarity function

Indexing methods (2) Triangle inequality (Barros et al.) When the query item q is presented, then d(q,r) is computed. For all database items i: Maximum threshold l=d(q,r); r the most similar item Search for distances closest to d(q,r) If d(i,r) inferior to d(q,r) is found, item i is regarded as the most similar item, and l=d(i,r). Continue until | d(i,r)-d(q,r)|  l

Index structures Fixed grids: non-hierarchical index structure that organizes the space into buckets. Grid files: fixed grids with buckets of unequal size K-d trees: Binary tree; the values of one of the k features is checked at each node. R-trees: partition the feature space into multidimensional rectangles SS-trees: Weighted Euclidean distance; suitable for clustering; ellipsoidal clusters

Performance evaluation

Wrap-up • Visual information retrieval is a research topic at the intersection of digital image processing, pattern recognition, and computer vision (fields of our interest/expertise) but also information retrieval, databases. • Related to semantic web • Challenging research topic dealing with many unsolved problems: • segmentation • machine similarity vs. human perception • focused searching

Comparison Iterated Conditional Modes (ICM) Split and Merge Learning Vector Quantizer (LVQ) Ability to extract meaningful image parts based on the ground truth Evaluation of still image segmentation algorithms Still Image Segmentation: Comparison of ICM and LVQ

The ICM method is based on the maximization of the probability density function of the image model given real image data. The criterion function is: where xs is the region assignment and ys is the luminance value of the pixel s mi and δi are mean value and the standard deviation of luminance of the region i; C is the clique of the pixel s, VC(x) is the potential function of C, N8(s) is 8x8 neighborhood of the pixel s. Iterated Conditional Modes (ICM)

Initial segmentation is obtained using the K-means clustering algorithm. Cluster center initialization is based on image intensity histogram. At each iteration probability, the value of the criterion function, is calculated for each pixel. Pixels are assigned to clusters- regions with maximum probability. Having a new segmentation, the mean intensity value and the cluster variance are estimated. The iterative process stops when no change occurs in clusters. For obtained segmentation, small regions are merged with nearest ones. The output image contains the large regions assigned the mean luminance value. How ICM works

Aristotle University of Thessaloniki Department of Informatics