220 likes | 391 Views
Searching in High-Dimensional Spaces Index Structures for Improving the Performance of Multimedia Databases. Christian Böhm, Stefan Berchtold, Daniel A. Keim ACM Computing Surveys, 2001. Introduction. Multimedia databases have become increasingly important in many application areas
E N D
Searching in High-Dimensional Spaces Index Structures for Improving the Performance of Multimedia Databases Christian Böhm, Stefan Berchtold, Daniel A. Keim ACM Computing Surveys, 2001
Introduction • Multimedia databases have become increasingly important in many application areas • Content-based retrieval of similar objects • Similarity search • Feature transformation • Multimedia object → high dimensional points (feature vector) • Search of points in the feature space that are close to a given query point
Similarity Queries • Basic idea of feature-based similarity search Feature Transformation Insert ε-Searchor NN-Search Complex Data Objects High-Dim. Feature Vectors High-Dim. Index NN Range query Nearest-neighbor query
Effects in High-Dimensional Space • Curse of dimensionality • Can you imagine 5 or 10-dimension? • “Every d-dimensional sphere touching (or intersecting) the (d-1)-dimensional boundaries of the data space contains c” • What happen if d=16?
Effects in High-Dimensional Space • Issues • Exponential growth of volume • Space partitioning • The majority of the data pages are located at the surface of the data space rather than in the interior • Coarse partitioning 0.917 0.5 0.917 0.25 0.5
Common Principles • Structure & Regions • Hierarchical clustering • Spatially adjacent vectors are likely to reside in the same node
Basic Algorithms • Index construction • Insert, Delete, and Update • Query processing • Exact match query • Range query • Nearest-neighbor query • Ranking query (generalized k-nearest-neighbor query) • Reverse nearest-neighbor query
Nearest-Neighbor Query • No fixed criterion, known a priori, to exclude branches of the indexing structure • The criterion is the nearest-neighbor distance • But it is not known until the algorithm has terminated • Pessimistic estimation • The closest point among all points visited (closest point candidate)
Nearest-Neighbor Query • RKV algorithm • MINDIST : the actual distance between the query point and page region • MINMAXDIST : estimation of the nearest neighbor distance • ‘Depth-first’ and ‘Branch and bound’ traversal MINMAXDIST MINDIST
Nearest-Neighbor Query • HS algorithm • Access all pages of the index in the order of increasing distance to the query point • Active page list (APL)
Nearest-Neighbor Query • Comparison • RKV • pr1 → pr12 → pr11 →… • HS • pr1 → pr2 → pr21
Index Structures • Minimum bounding rectangles • R-tree family • X-tree • Bounding spheres • SS-tree • TV-tree • Combined regions • SR-tree • Etc. • Space filling curves • Pyramid-tree
R, R*, R+-Tree • Overlap problem • For an overlap-free split, a dimension is needed in which the projections of the page regions have no overlap at some point • Existence of such a point becomes less likely as the dimension of the data space increases • R+ tree • An overlap-free variant of the R-tree using a forced-split strategy • High dimensionality leads to many forced-split operations. • Storage utilization < 50% a A
X-Tree • Extension of the R*-tree • Designed for the management of high-dimensional objects • Overlap-free split (split history) • Supernodes (unbalanced split tree)
kd-Tree • Advantage • Guarantee of no overlap • Disadvantages • Complete partitioning • Page regions are generally larger than necessary which yields a higher access probability • Unbalanced
kd-Tree • kd-B-tree • Balanced kd-tree • Forced split • hB-tree • Splitting a node based on multiple attributes • Forced split is avoided • LSDh-tree • Coded region description • Reduce space requirement
SS-Tree • Spheres as page regions • Split • Split axis is determined as the dimension yielding the highest variance • Not amenable to an easy overlap-free split
Space Filling Curves • Range and nearest-neighbor queries based on distance calculations of page regions lb : 47 = 101111 ub : 60 =111100 longest common prefix : p =1 s = <p100…000> = 110000 = 48 q lb : 48 = 110000 ub : 60 =111100 longest common prefix : p =11 s = <p100…000> = 111000 = 56 I21 I I2 I22 I1
Pyramid Tree • Divide the data space such that the resulting partitions are shaped like peels of an onion • Pyramid mapping • Optimized for range queries on high-dim. data • Not affected by the curse of dimensionality
Conclusions • Effects occurring in indexing high-dim. spaces • Principal ideas of the index structures that have been proposed to overcome the problems • Research on high-dim. indexing has a major impact on many practical applications and commercial multimedia database system • Future Research Issues • Real case (not uniform and not independent data) • Partitioning strategies that perform well in high-dim. • Approximate processing of NN queries