Searching in High-Dimensional Spaces

Searching in High-Dimensional Spaces Index Structures for Improving the Performance of Multimedia Databases Christian Böhm, Stefan Berchtold, Daniel A. Keim ACM Computing Surveys, 2001

Introduction • Multimedia databases have become increasingly important in many application areas • Content-based retrieval of similar objects • Similarity search • Feature transformation • Multimedia object → high dimensional points (feature vector) • Search of points in the feature space that are close to a given query point

Similarity Queries • Basic idea of feature-based similarity search Feature Transformation Insert ε-Searchor NN-Search Complex Data Objects High-Dim. Feature Vectors High-Dim. Index NN Range query Nearest-neighbor query

Effects in High-Dimensional Space • Curse of dimensionality • Can you imagine 5 or 10-dimension? • “Every d-dimensional sphere touching (or intersecting) the (d-1)-dimensional boundaries of the data space contains c” • What happen if d=16?

Effects in High-Dimensional Space • Issues • Exponential growth of volume • Space partitioning • The majority of the data pages are located at the surface of the data space rather than in the interior • Coarse partitioning 0.917 0.5 0.917 0.25 0.5

Common Principles • Structure & Regions • Hierarchical clustering • Spatially adjacent vectors are likely to reside in the same node

Basic Algorithms • Index construction • Insert, Delete, and Update • Query processing • Exact match query • Range query • Nearest-neighbor query • Ranking query (generalized k-nearest-neighbor query) • Reverse nearest-neighbor query

Nearest-Neighbor Query • No fixed criterion, known a priori, to exclude branches of the indexing structure • The criterion is the nearest-neighbor distance • But it is not known until the algorithm has terminated • Pessimistic estimation • The closest point among all points visited (closest point candidate)

Nearest-Neighbor Query • RKV algorithm • MINDIST : the actual distance between the query point and page region • MINMAXDIST : estimation of the nearest neighbor distance • ‘Depth-first’ and ‘Branch and bound’ traversal MINMAXDIST MINDIST

Nearest-Neighbor Query • HS algorithm • Access all pages of the index in the order of increasing distance to the query point • Active page list (APL)

Nearest-Neighbor Query • Comparison • RKV • pr1 → pr12 → pr11 →… • HS • pr1 → pr2 → pr21

Index Structures • Minimum bounding rectangles • R-tree family • X-tree • Bounding spheres • SS-tree • TV-tree • Combined regions • SR-tree • Etc. • Space filling curves • Pyramid-tree

R, R*, R+-Tree • Overlap problem • For an overlap-free split, a dimension is needed in which the projections of the page regions have no overlap at some point • Existence of such a point becomes less likely as the dimension of the data space increases • R+ tree • An overlap-free variant of the R-tree using a forced-split strategy • High dimensionality leads to many forced-split operations. • Storage utilization < 50% a A

X-Tree • Extension of the R*-tree • Designed for the management of high-dimensional objects • Overlap-free split (split history) • Supernodes (unbalanced split tree)

kd-Tree • Advantage • Guarantee of no overlap • Disadvantages • Complete partitioning • Page regions are generally larger than necessary which yields a higher access probability • Unbalanced

kd-Tree • kd-B-tree • Balanced kd-tree • Forced split • hB-tree • Splitting a node based on multiple attributes • Forced split is avoided • LSDh-tree • Coded region description • Reduce space requirement

SS-Tree • Spheres as page regions • Split • Split axis is determined as the dimension yielding the highest variance • Not amenable to an easy overlap-free split

Space Filling Curves • Range and nearest-neighbor queries based on distance calculations of page regions lb : 47 = 101111 ub : 60 =111100 longest common prefix : p =1 s = <p100…000> = 110000 = 48 q lb : 48 = 110000 ub : 60 =111100 longest common prefix : p =11 s = <p100…000> = 111000 = 56 I21 I I2 I22 I1

Pyramid Tree • Divide the data space such that the resulting partitions are shaped like peels of an onion • Pyramid mapping • Optimized for range queries on high-dim. data • Not affected by the curse of dimensionality

Summary & Comparison

Conclusions • Effects occurring in indexing high-dim. spaces • Principal ideas of the index structures that have been proposed to overcome the problems • Research on high-dim. indexing has a major impact on many practical applications and commercial multimedia database system • Future Research Issues • Real case (not uniform and not independent data) • Partitioning strategies that perform well in high-dim. • Approximate processing of NN queries

Searching in High-Dimensional Spaces

Searching in High-Dimensional Spaces

Presentation Transcript

High Dimensional Chaos

Estimation of failure probability in higher-dimensional spaces

High Dimensional Chaos

High Dimensional Chaos

High Dimensional Indexing

Asymmetric Distance Estimation with Sketches for Similarity Search in High-Dimensional Spaces

On Improving the Clearance for Robots in High-Dimensional Configuration Spaces

Probab ilistic Roadmaps for Path Planning in High-Dimensional Configuration Spaces

High-Dimensional Data

High Dimensional Chaos

Finite Dimensional Vector Spaces

High Dimensional Chaos

Feature Extraction for Outlier Detection in High-Dimensional Spaces

Probabilistic Roadmaps for Path Planning in High-Dimensional Configuration Spaces

Learning of Data Collections in High-dimensional Spaces Without Supervision

Clustering and Indexing in High-dimensional spaces

Probabilistic Roadmaps for Path Planning in High-Dimensional Configuration Spaces

6.4 Finite Dimensional Spaces

High Dimensional Data

Estimation of failure probability in higher-dimensional spaces