Alexander Gelbukh Gelbukh

Special Topics in Computer ScienceAdvanced Topics in Information RetrievalLecture 6 (book chapter 12): Multimedia IR:Indexing and Searching Alexander Gelbukh www.Gelbukh.com

Previous Chapter: Conclusions • Basically, images are handled as text described them • Namely, feature vectors (or feature hierarchies) • Context can be used when available to determine features • Also, queries by example are common • From the point of view of DBMS, integration with IRand multimedia-specific techniques is needed • Object-oriented technology is adequate

Previous Chapter: Research topics • How similarity function can be defined? • What features of images (video, sound) there are? • How to better specify the importance of individualfeatures? (Give me similar houses: similar = size?color? strructure? Architectural style?) • How to determine the objects in an image? • Integration with DBMSs and SQL for fast access and rich semantics • Integration with XML • Ranking: by similarity, taking into account history, profile

The problem • Data examples: • 2D/3D color/grayscale images: e.g., brain scans, scientific databases of vector fields • (2D) video, • (1D) voice/music; (1D) time series: e.g., financial/marketing time series; DNA/genomic databases • Query examples: • find photographs with the same color distribution as this • find companies whose stock prices move as this one • find brain scans with a texture of a tumor • Applications: search; data mining

Solution • Reduce the problem to search for multi-dimensional points (feature vectors, but vector space is not used) • Define a distance measure • for time series: e.g., Euclidean distance between vectors • for images: e.g., color distribution (Euclidean distance); another approach: mathematical morphology • Other features as vectors • For search within distance, the vectors are organized in R-trees • Clustering plays important role

Types of queries • All within given distance • Find all images that are within 0.05 distance from this one • Nearest-neighbor • Find 5 stocks most similar to IBM • All pairs within given distance • Further: clustering • Whole object vs. sub-pattern match • Find parts of image that are... • E.g., in 512  512 brain scans, find pieces similar to the given 16  16 typical X-ray of a tumor • Like passage retrieval for text documents

Neighbor and pairs types of queries • The objects are organized in R-trees • For neighbor queries: branch-and-bound algorithm • For pairs: recently discovered algorithms • These types of queries are not discussed here

Desiderata for a method • Fast • No sequential search with all objects • Correct • 100% recall • Precision is less important, though kept low. False alarms are easy to discard manually • Little space overhead • Dynamic • easy to insert, delete, update

Types of methods • Linear quadtrees • Complexity = hypersurface of the query region • Grows exponentially with dimensionality • grid-files • Complexity grows exponentially with dimensionality • R-trees methods, such as R*-trees • Most used due to lower complexity

R-tree • Objects and parts of images represented as Minimal Bounding Rectangle (MBR) • Can overlap for different objects • Larger objects contain smaller objects • MBRs are nested • MBRs are arranged into a tree • In storage, an index of disk blocks is maintained • Disk blocks are fetched at once at hardware level • For better insertion/deletion, tight MBRs are needed • Good clustering is needed

File structure of R-tree • Corresponds to disk blocks • Fanout = 3: number of parts to group

R-tree R-tree

Search in R-tree Range queries: find objects within distance  from query object • = Find MBRs that intersect with query’s MBR • Determine MBR of the query • Descend the tree • Discarding all MBRs that do not intersect with the query’s MBR Many variations of R-tree method have been proposed

Indexing Only consider here whole match queries • Given collection of objects and distance function • Find objects within given distance  from given object Q • Problems: • Slow comparison of two objects • Huge database • GEMINI approach • GEneric Multimedia object INdexIng • Attempts to solve both problems

GEMINI indexing • Quick-and-dirty test to quickly discard bad objects • Uses clusters to avoid sequential search • Quick test • Single-valued feature, e.g., average for series.Averages differ much  objects differ much • Not vice-versa. False alarms are OK • Several features, but fewer than all data. E.g., deviation for series

Algorithm • Map the actual objects into f-dimensional feature space • Use clusters (e.g., R-trees) to search • Retrieve objects, compute the actual distances, and discard false alarms

Feature selection • Features should reflect distances • Allow no misses (100% recall) • features should make things look closer • Lower Bound lemma: • If distance in feature space  actual distance • then 100% recall • (we speak about whole-match queries) • Holds for distance search, nearest-neighbor, pair search

Algorithm (more detail) • Determine distance • Choose features • Prove that distance in feature space  for actual objects • Use quick method (R-tree) to search in feature space • For found objects, compute the actual distances (this can be expensive) • Discard false alarms • objects with greater actual distances, even if in feature space the distance is OK • Example: similar averages, but different series

Discussion • The method does NOT improve quality • Provides SAME quality as sequential search, but faster • Distance definition requires domain/application expert • How much do the two images differ? • What is important/unimportant for the specific application? • Feature selection requires a good knowledge engineer • Choose the most characteristic feature: discriminative • If needed, choose the second best, etc. • Good features should be orthogonal: combination adds info

Example: Time series • In yearly stock movements, find ones similar to IBM • Distance: Euclidean (365-D vectors); others exist • Features: • First feature is average. • If needed, Discrete Fourier Transform (DFT) coefficients • Or, Discrete Cosine Transform, waivelet Transform, etc. • Lower-bound lemma: • Parseval theorem: DFT preserves distances (DCT, WT too) • First several coefficients give  distance • Transforms “concentrate energy” in the first coefficients • Thus, the more realistic prediction of distance

Time series: Applications • Such feature selection is effective for many skewed spectrum distributions • Colored noises: the energy decreases as F–b • b = 0: white spectrum: unpredictable. Method useless. • b = 1: pink noise: works of art • b = 2: brown noise: stock movements • b > 2: black noise: river levels, rainfall patterns • The greater b the better the first coefficients of the transform predict the actual distance • Some other n-D signals show similar properties • JPEG compression ignores higher coefficients

Time series: Performance • Fewer features  more false alarms  time lost • More features  more complex computation • Optimal number of features proves to be about 1..3 • for skewed enough distributions • JPEG compression shows that photographs have it

Time series: Sub-pattern search • Use sliding window • Encode each window with few features

Example: Color images • Give me images with a texture of tumor like this one • Give me images with blue at top and red at bottom • Handles color, texture, shape, position, dominant edges

Color images: Color representation • Compute color histogram • Distance: use color similarity matrix • Very expensive computationally: cross-talk between features (compare all to all features)

Color images: Feature mapping • The GEMINI question again: What single feature is the most representative? • Take average R, G, B • Lower-bound? • Yes: Quadratic Distance Bounding theorem

Automatic feature selection • Features can be selected automatically • In texts: Latent semantic indexing (LSI) • Many methods • Principle components analysis (= LSI), ... • In fact, they can reduce features, but not define them • Of colors, one can select characteristic combinations • But not classify into faces and flowers • So description of the objects is still on human researchers

Research topics • Object detection (pattern and image recognition) • Automatic feature selection • Spatial indexing data structures (more than 1D) • New types of data. • What features to select? How to determine them? • Mixed-type data (e.g., webpages, or images withsound and description) • What clustering/IR methods are better suited forwhat features? (What features for what methods?) • Similar methods in data mining, ...

How to accelerate search? Same results as sequential Ideas: Quick-and-dirty rejection of bad objects, 100% recall Fast data structure for search (based on clustering) Careful check of all found candidates Solution: mapping into fewer-D feature space Condition: lower-bounding of the distance Assumption: skewed spectrum distribution Few coefficients concentrate energy, rest are less important Conclusions

Thank you! Till Tuesday 11, 6 pm

Alexander Gelbukh Gelbukh