Outline

3D Shape Histograms for Similarity Search and Classification in Spatial Databases.Mihael Ankerst,Gabi Kastenmuller,Hans-Peter-Kriegel,Thomas SeidlUniv of Munich, Germany

Outline • Introduction • 3D Shape Similarity Model • Quadratic Form Distance Functions • Extensibility of Histogram Models • Query Processing • Experimental Results and Conclusion

Introduction • Classification • the problem of assigning an appropriate class to the query object • Applications -molecular biology, medical imaging mechanical engg., astronomy • Objects of same class have some characteristic properties in common. • These could be geometric properties , thematic properties.

Classification in Molecular Databases • Classification schemata is already available • We need a fast filter classification algorithm • Dali System - a sophisticated classification algorithm for proteins • CATH – hierarchical classification of protein domain structures • Four levels – class, architecture, topology and homologous super family.

Nearest Neighbor Classification • In general classification is done after training • Object is assigned if it matches the description of the class • Nearest neighbor classifiers –find the nearest neighbor and return its class • K- nearest neighbors - #k, Weights of neighbors

Geometry Based Similarity Search • Spatial objects transformed into high dimensional vector space • In 2D shapes can be represented as ordered set of surface points, approx rectangular coverings etc. • Section Coding technique – each polygon’s circumcircle is decomposed into number of sectors, and each of these sectors are normalized. • Similarity is defined in terms of Euclidean distance between resulting feature vectors.

Invariance Properties • Similarity models need to incorporate invariance against translation, rotation, scaling etc. • Most of the methods include a preprocessing step such as rotation of objects to a normalized orientation, translation of center of mass to origin etc. • Robustness against errors is not considered in most of these models

3D Shape Similarity Model • We extend the concept of section coding technique to 3D. • Shape Histograms – feature vectors • Quadratic Distance Function

Shape Histograms • Feature transform maps a complex object onto a feature vector in a multidimensional space. • 3D shape histograms are also feature vectors • Based on partitioning the space into complete and disjoint cells called the bins of the histogram • We can use any space (geometric , thematic etc.)

Shell Model • 3D space is decomposed into concentric shells around the center point • Independent of rotation around the center • Radii of the shells are determined from the extension of the objects • Shells of uniform thickness

Sector Model • 3D space is decomposed into sectors that emerge from the center point of the model • Distribute points uniformly on the surface of the sphere. • The Voronoi diagram gives an appropriate decomposition of the space.

Combined Model • Combination of shell and sector models • Results in a higher dimensionality • We can different combinations of shells and sectors for the same dimensionality

Euclidean Distance • Euclidean Distance between two N dimensional vectors p and q is given by • Individual components of the feature vectors are assumed to be independent • No relationships of the components such as substitutability and compensability may be regarded

Euclidean Distance • Consider 3 objects a, b and c • We can clearly see ‘a and b’ are closely related than ‘a and c’ or ‘b and c’ • However due to rotation, the peaks of ‘a’ and ‘b’ are mapped into different bins and hence the Euclidean distance does not reflect similarity in this case

Quadratic Form Distance Function • Quadratic form distance function is defined in terms of similarity matrix ‘A’ • The components aij of Arepresent similarity of the components i and j in the underlying space • Euclidean distance is a specific case of Quad Form Distance where A= I, the Identity Matrix

Quadratic Form Distance Functions • Euclidean distance of two vectors is totally determined • Weighted Euclidean distance is a little more flexible , for it controls the effect of individual vector component onto the overall distance • On top of this, General Quad form distance function also specifies cross-dependencies of the dimensions

Quadratic Form Distance Functions • The neighborhood of the bins can be represented as the similarity weights • Let d(i,j) represent the distance of the cells that correspond to bin i and j • For shells the bin distance is the difference in the corresponding radii • For sectors the bin distance is the difference in the angles of sector centers

Quadratic Form Distance Functions • When provided with appropriate distance function, the similarity matrix can be computed as aij = e-σ.d(i,j) where the parameter σ controls the global shape of the similarity matrix.

Invariance Properties • During normalization , we perform translation and rotation of all objects • Translation is done such that the COM maps onto the Origin • Principal Axes Transform is done • This generally leads to unique orientation of the object

Principal Axes Transform • Compute the Covariance matrix for a given 3D set of points (x,y,z)

Principal Axes Transform • The eigen vectors of this matrix represent the principal axes of the original 3D point set • The eigen values indicate the variance of the points in the respective direction • As a result of PAT all the covariances of the transformed points vanish

Extensibility of Histogram Models • Along with spatial properties we can also consider thematic properties • General approach to manage both thematic and spatial properties is to use combined histograms • Combined histogram is the cartesian product of the individual histograms

Query Processing • In case of Quad Form Distance Function, the evaluation time of a single database object increases quadratically with dimension

Optimal Multistep k- Nearest Neighbor Search • In order to achieve a good performance , the paradigm of mutlistep query processing is used • An index-based filter step produces a set of candidates • Refinement step performs the expensive exact evaluation of the candidates • Filter is responsible for completeness and refinement for correctness

Optimal Multistep k- Nearest Neighbor Search • Based on multi-dimensional index structure, the filter step performs an incremental ranking • objects ordered by their increasing filter distance to the query are reported • In order to guarantee no false dismissals caused by the filter step, dj(p,q) ≤ do(p,q) Where dj =filter distance and d0 = object distance

Reduction in Dimensionality of Quadratic Forms • Objects in high dimensional spaces are managed by reducing their dimensionality • Typically this is done by Principal Component Analysis, Discrete Fourier transform, Similarity Matrix decomposition, Feature Subselection etc. • These approaches can also be used in case of Quadratic Form Distance

Reduction in Dimensionality of Quadratic Forms • An algorithm to reduce the similarity matrix from a high-dim. space down to a low-dim. space was developed in the context of multimedia databases. • The method guarantees three things • the reduced distance function is a lower bound of the given high-dimensional distance function. • the reduced distance function again is a quadratic form • the reduced distance function is the greatest of all lower-bounding distance functions in the reduced space.

Experimental Evaluation • Data is taken from Brookhaven Protein Databank. • Molecules are represented as surface points for the computation of shape histograms • Reduced Feature Vectors for the filter step are managed by a X-tree of dimension 10.

Experimental Evaluation • Similarity Matrices are computed by an adapted formula from where the similarity weights aij of bin i and j are defined as aij = e-σ.d(i,j) • σ = 10

Basic Similarity Search

Classification by Shape Similarity • Every class has at least two molecules • From Preprocessing , 3422 proteins have been classified into 281 classes • 3models pure shell model, pure sector model and combined model have been considered . • The accuracy for the combined model is the best

Classification by Shape Similarity

Outline

Outline

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: