330 likes | 428 Views
Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method. Gang Qian University of Central Oklahoma November 2006. Summary. Overview Motivation and Existing Work NSP-Tree Structure, Algorithms and Performance Conclusion and Future Work. Overview.
E N D
Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006
Summary • Overview • Motivation and Existing Work • NSP-Tree Structure, Algorithms and Performance • Conclusion and Future Work
Overview • The NSP-tree is a disk-based index structure • Similar to B-tree/B+-tree • It is designed to index a large amount of vectors with non-ordered discrete components • Domains with discrete values that are not naturally ordered are very common • E.g., gender, profession, genome bases, etc. • It is used to speed up similarity queries over the indexed data • Unlike exact queries, a similarity query searches for data items that are similar to the given query data item
Motivation • Traditional database technology is mature • Data model: Relational Data Model • Design: ER/EER Diagrams • Query: SQL • Data integrity: Transaction Processing • Index: B-tree/B+-tree • Some hard unsolved issues still exist • E.g., Multidimensional Query Optimization
New problems occur with the increasing demand for the management of non-traditional data types • Multimedia data • Scientific data • Spatial data • Temporal data • Biological data, etc. • With the new data types, exact queries are no longer useful • Similarity queries become more and more important
Vector Model • The Vector Model is one of the very useful tools to support these new data types • Many non-traditional data types are vectors or can be easily converted into vectors • E.g., feature vectors for images • Vectors can be deemed as points in high dimensional data spaces • Therefore, the distance between a pair of vectors is a natural quantitative measure of (dis)similarity between two data objects that the two vectors represent • E.g., Euclidean distance
The problem of managing non-traditional databases becomes the problem of managing vector databases • Designing index structures to support efficient similarity queries on vectors is an open research area of vector databases • For example, the NSP-tree is designed to index vectors with discrete and non-ordered components • E.g., genome sequence data
Existing Work • A number of index structures are proposed for vectors with continuous numerical components • E.g., R-tree and its variants: • SS-tree • SR-tree • X-tree • Hybrid tree, etc. • Due to the volume of the data, almost all proposed index structures are disk-based
The basic structure of these indices are very similar to that of the B+-tree • Hierarchical tree structure • Each tree node occupies one and only one disk block and has a minimum utilization requirement • Vectors are stored in leaf nodes • Non-leaf nodes contain routing information that is used for tree construction and searching • Routing information are usually represented by a certain type of minimum bounding shapes • Minimum Bounding Rectangle (MBR), Minimum Bounding Sphere (MBS), etc.
Example: R-Tree Structure Figure adopted from “The SR-tree: An Index Structure for High-Dimensional Nearest Neighbor Queries” (SIGMOD 1997).
Such an index tree grows in a bottom-up fashion • Vectors are incrementally inserted into the tree • When a leaf node is full, it is split into two leaves • The split of a child in the tree may cause the split of a parent • Node split may propagate all the way up to the root, when the root itself will be split to create a new root • Search works top-down from the root • Search performance is usually measured in terms of the total number of disk blocks/nodes accessed • Search efficiency is derived from pruning branches that are not within the search range • Unlike a brute force linear search, vectors in irrelevant branches will not be visited
Unfortunately, those index trees mentioned in previous slides cannot be directly used for vectors with non-ordered discrete components • The ND-tree was proposed to index such vectors • See “The ND-Tree: A Dynamic Indexing Technique for Multidimensional Non-ordered Discrete Data Spaces” (VLDB 2003)
Discrete Space Concepts • The structure of the ND-tree is very similar to those of the R-tree variants • However, all the underlying geometrical concepts are redefined to accommodate discrete vectors
Example: Discrete Rectangles • Introduced to bound vectors with non-ordered discrete components • Normal rectangle can be deemed as the Cartesian product of ranges for every dimension in the data space • E.g., [0.1, 0.2] [0.7, 0.8] is a two-dimensional rectangle • A discrete rectangle is defined as the Cartesian product of sets of discrete values from every dimension • E.g., {a, g} {t, c, g} is a two-dimensional discrete rectangle that covers vectors such as <a, c>, <g, t> and <g, g> • Discrete Minimum Bounding Rectangles (DMBR) store the routing information for the ND-Tree
Problem of The ND-tree • Overlap in an index tree may dramatically affect its search performance • The construction of the ND-tree cannot totally avoid the overlap among DMBRs in the tree • The ND-tree works well when the data is randomly distributed • However, for certain data sets, overlap cannot be avoided • For example, the skewed data set based on the Zipf distribution • To guarantee the minimum disk utilization, the split algorithm may NOT be able to find an overlap-free split for an overflow node
Basic Idea of The NSP-Tree • There are three factors that affect search performance • Disk utilization • Overlap • Fan-out • Maximum number of children of a tree node • Since overlap can not be totally avoided when there is a minimum disk utilization requirement, the design of the NSP-tree dropped the requirement so that overlap-free can be guaranteed
Space-Partitioning Indexing Methods • Ideas of overlap-free index structures are not new • What makes the NSP-tree new is that it can handle non-ordered discrete data based on an overlap-free structure • There are a category of index trees that have such a feature • KDB-tree • hB-tree • LSD-tree, etc. • They are called space-partitioning indexing methods • R-tree variants are called data-partitioning indexing methods • All previous space-partitioning indices support only vectors with continuous numeric components
1 d:2 v: 0.6 d:1 v: 0.75 d:2 v: 0.2 d:1 v: 0.4 d:1 v: 0.2 d:2 v: 0.3 d:1 v: 0.6 0.6 0.3 0.2 0 0.2 0.4 0.6 0.75 1 Partitioned Data Space Space-partitioning Information <= > <= > <= > d: Split dimension v: Split point on the split dimension
NSP-Tree Structure • Similar to those of the B+-tree and the R-tree, but with no minimum disk utilization requirement • Each node occupies one disk block • Vectors are stored in leaf nodes • Space-partitioning information are stored in non-leaf nodes • The space concept in the NSP-tree is discrete • A discrete data space is defined as the Cartesian product of the sets of all possible values on every dimension • Due to the non-ordered nature of the values, a split point on a split dimension is no long enough to describe a split • Need to explicitly record how each values on a dimension are separated into two groups
Conceptually, each node corresponds to a subspace of the discrete data space • A subspace is defined as the Cartesian product of the subsets of values on every dimension • There is no overlap among the subspaces of the children on the same level • The subspace of a parent node contains the subspaces of all its children
Eliminating Dead Space • One disadvantage of a pure space-partitioning approach is that the subspaces do not necessarily minimally bound the vectors in the space • See next slide • To further improve the pruning power, DMBRs are used as additional routing information in tree • However, the use of DMBRs reduces the fan-out of tree • More space in a node is needed to store the DMBRs • We found that the benefits of using DMBRs are usually greater than the disadvantage of the decrease of the fan-out
1 0.6 0.3 0.2 0 0.2 0.4 0.6 0.75 1 Subspace is not minimum bounding Actual Minimum Bounding Rectangle Dead space r Q
Tree Construction Algorithms • An NSP-tree grows incrementally • Vectors are inserted one by one • Insertion starts from the root and goes down the tree until a suitable leaf node is found for the new vector • The tree grows in a bottom-up fashion • There are two import algorithms used in the insertion procedure • ChooseSubtree • SplitNode
ChooseSubtree • Starting from the root, it is invoked on non-leaf nodes • Given the vector to insert, the algorithm decides which child nodes to follow based on whether a child’s subspace contains the new vector or not • Due to the overlap-free property, there exists at most one child that can contain the new vector • SplitNode • Splits an overflow node into two nodes • The split guarantees overlap-free • It also tries to maximize disk utilization by choosing the most balanced split
There are other algorithms for the NSP-tree • Generating and maintaining DMBRS • Query • Deletion, etc.
Summary • The NSP-tree is the first indexing method that uses the space-partitioning approach to index vectors with non-ordered discrete components • The benefit of using an overlap-free tree structure is obvious when data distribution is skewed • With proper heuristics, the disadvantage of the removal of the minimum disk utilization requirement can be minimized • In general, the benefit of using DMBRs to eliminate dead space (hence, increasing the pruning power) overrides the disadvantage of the fan-out decrease
Future Work • Bulkloading the NSP-tree and the ND-tree • Insert more than one vector at a time • Support approximate similarity queries • Beat the Curse of High Dimensionality • Support queries based on the Editor Distance • Besides the Hamming distance, the Editor distance is another widely-used distance measure for discrete vectors • Aggregate all the technology into a viable bioinformatics search engine