140 likes | 245 Views
Indexing Multidimensional Data. Rui Zhang http://www.csse.unimelb.edu.au/~rui The University of Melbourne Aug 2006. Outline. Backgrounds Multidimensional data and queries Approaches Mapping based indexing Z-curve iDistance Hierarchical-tree based indexing R-tree k-d-tree Quad-tree
E N D
Indexing Multidimensional Data Rui Zhang http://www.csse.unimelb.edu.au/~rui The University of Melbourne Aug 2006
Outline • Backgrounds • Multidimensional data and queries • Approaches • Mapping based indexing • Z-curve • iDistance • Hierarchical-tree based indexing • R-tree • k-d-tree • Quad-tree • Compression based indexing • VA-file
Multidimensional Data (low-dimensionality) • Spatial data • Geographic Information: Melbourne (37, 145) • Which city is at (30, 140)? • Computer Aided Design: width and height (40, 50) • Any part that has a width of 40 and height of 50? • Records with multiple attributes • Employee (ID, age, score, salary, …) • Is there any employee whose age is under 25 and performance score is greater than 80 andsalary is between 3000 and 5000 • Multimedia data • Color histograms of images • Give me the most similar image to • Multimedia Features: color, shape, texture (medium-dimensionality) (high-dimensionality)
Multidimensional Queries • Point query • Return the objects located at Q(x1, x2, …, xd). • E.g. Q=(3.4, 6.6). • Window query • Return all the objects enclosed or intersected by the hyper-rectangle W{[L1, U1], [L2, U2], …, [Ld, Ud]}. • E.g. W={[0,4],[2,5]} • K-Nearest Neighbor Query (KNN Query) • Return k objects whose distances to Q are no larger than any other object’ distance to Q. • E.g. 3NN of Q=(4,1)
Mapping Based Multidimensional Indexing Sort • Story • The CBD:[0,4][2,5] • Blocks in the CBD are:[8,15], [32,33]and[36,37] • General strategy: three steps • Data mapping and indexing • Query mapping and data retrieval • Filtering out false positive
The Z-curve and Other Space-Filling Curves • The Z-curve • Z-value calculation: bit-interleaving • Support efficient window queries • Disadvantage • Jumps • Other space-filling curves • Hilbert-curves • Gray-code • Column-wise scan
2 1 3 Mapping for KNN Queries Sort 24 23 22 21 • Story continued • New factory atQ[4,1] • Find 3 nearest buildings to Q • Termination condition • K candidates • All in the current search circle 14 4 13 3 12 2 32 11 1 31 Q R = 1.75 R = 0.35 R = 0.70 R = 1.05 R = 1.40 R = 2.10 ||CQ|| = 1.84 ||DQ|| = 2.05 ||BQ|| = 1.81 ||FQ|| = 3.62 ||AQ|| = 3.31 ||EQ|| = 3.00
The iDistance • Data partitioned into a number of clusters • Streets are concentric circles • Data mapping • Objects mapped to street numbers • Query mapping • Search circle mapped to streets intersected
Hierarchical Tree Structures • K-d-tree • Space division recursively • Complete and disjoint partitioning • In-memory; Unbalanced • There are algorithms to pageand balance the tree, but withmore complex manipulations • R-tree • Minimum bounding rectangle (MBR) • Incomplete and overlapping partitioning • Disk-based; Balanced N3 N1 N3 N3 N1 N3 N4 N1 N1 N1 A N1 B N2 C D A A B 0.5 C D A D D N1 N5 N2 N1 N2 G F F F C A D B 0.3 C E N5 F G A C D B E E N2 E C B G B N1 N2 N2 N4 B C E F G A A D D Problem: Overlap Problem: Empty space G F F C E E C B G B
Hierarchical Tree Structures (continued) • Quad-tree • Space divided into 4 rectanglesrecursively. • Complete and disjoint partitioning • In-memory; Unbalanced • There are algorithms to pageand balance the tree, but withmore complex manipulations • The point quad-tree A NW NE NW NE SW SE D A F D C B C B E G SE G E F SW
Compression Based Indexing • The dimensionality curse • The Vector Approximation File (VA-File) VA File Skewed data
Index Implementations in major DBMS • SQL Server • B+-Tree data structure • Clustered indexes are sparse • Indexes maintained as updates/insertions/deletes are performed • Oracle • B+-tree, hash, bitmap, spatial extender for R-Tree • Clustered index • Index organized table (unique/clustered) • Clusters used when creating tables • DB2 • B+-Tree data structure, spatial extender for R-tree • Clustered indexes are dense • Explicit command for index reorganization
Recommended Readings and References • Survey on multidimensional indexing techniques • Christian Böhm, Stefan Berchtold, Daniel A. Keim. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases.ACM Computing Surveys 2001. • Volker Gaede, Oliver Günther. Multidimensional Access Methods.ACM Computing Surveys 1998 • Mapping based indexing • Rui Zhang, Panos Kalnis, Beng Chin Ooi, Kian-Lee Tan. Generalized Multi-dimensional Data Mapping and Query Processing.ACM Transactions on Data Base Systems (TODS), 30(3), 2005. • Space-filling curves • H. V. Jagadish. Linear Clustering of Objects with Multiple Atributes.ACM SIGMOD Conference (SIGMOD) 1990. • iDistance • H.V. Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu, Rui Zhang. iDistance: An Adaptive B+-tree Based Indexing Method for Nearest Neighbor Search.ACM Transactions on Data Base Systems (TODS), 30(2), 2005. • R-tree • Antonin Guttman. R-Trees: A Dynamic Index Structure for Spatial Searching. ACM SIGMOD Conference (SIGMOD) 1984. • Quad-tree • Hanan Samet. The Quadtree and Related Hierarchical Data Structures.ACM Computing Surveys 1984. • VA-File • Roger Weber, Hans-Jörg Schek, Stephen Blott. A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces.International Conference on Very Large Data Bases (VLDB)1998.