200 likes | 303 Views
Multidimensional Indexes [14]. Motivation. Examined indexes when search keys are in 1-D space Many interesting applications have search keys in multi-dimensional spaces Queries beyond exact match and range queries Partial match Nearest neighbor Where-am-I Drill-down and roll-up queries
E N D
Motivation • Examined indexes when search keys are in 1-D space • Many interesting applications have • search keys in multi-dimensional spaces • Queries beyond exact match and range queries • Partial match • Nearest neighbor • Where-am-I • Drill-down and roll-up queries • Geographic Information Systems (GIS) manage spatial information in 2-D or 3-D
Motivation • 1-D indexes are bad for range and partial match queries in 2-D • Consider • points (x,y) on the plane • B-tree index on x, or y, or <xy>, or <yx>
Outline • We will examine few additional index structures • Grid files • Partioned hashing • Multiple-key indexes • Kd-trees • Quad trees • R-trees • Bitmap indexes
Grid files • Idea is based on hashing • Consider an M-dimensional space • Partition each dimension (axis) into a number of left-closed 1-D intervals • These partitions define grid lines • Partitions on each axiscan be arbitrary and independent from each other • These intervals partition the whole space into grid cells • Associate with each grid cell a bucket of blocks to store records that hash into that grid cell • Grid file consists of the axis intervals together with an M-dimensional array of bucket pointers
Grid files - example Key 2 X1 X2 …… Xn V1 V2 Key 1 Vn To bucket records with key1=V3, key2=X2
Performance of Grid files • Number of grid cells grows exponentially with #dimensions • Lookup specific points • Extra cost of redirection and length of chain in bucket • Partial match and range queries • Proportional to the area of query over whole space • NN queries • Locate bucket containing query point, then search it together with its addjacent buckets • If aspect ratio is very high more buckets will need to be searched • Important to partition axes so that grid cells get about same #records
Partitioned hashing • Consider a hash table with 2b buckets • Allocate the b bits to the attributes in the search key • Value of attribute will determine the value of its assigned bits, using a hash function specific for that attribute • Hash of search key is concatenation of hashes of the attributes • Similar to grid files, except that the axes are not partitioned into contiguous intervals anymore
Grid files vs Partitioned hashing • Partitioned hashing is useless for NN or range queries • Grid files many need large number of buckets to achieve good distribution of records into buckets • For NN and range queries use grid files • For partial match queries use partitioned hashing • Tiny buckets • hold too few records with respect to what fits into a block • Pack tiny buckets into a few blocks • Care is needed when splitting suck buckets etc
Multiple-key Indexes • A tree-like 1-D index whose index entries are themselves multiple-key indexes on the remaining attributes of the search key • A multiple-key index for points (x,y) on the plance • A B-tree on X whose leaves point to B-trees on Y • All points with the same X coordinate are in the same Y B-tree • Performance • Partial-match queries • Good if prefix attributes are always specified • Range queries – works well • NN – guess distance to NN then do a range query
Kd-Trees • A binary tree with each node v associated with an attribute A and a value V that splits the data for the subtree rooted at v into two parts, those whose A attribute is less than V and those whose A attribute is larger than V • Attributes alternate among the nodes on a path from the root to a leaf • Records are stored at buckets associated with the leaves • Can allow multiway splits at nodes to reduce tree height • Pack internal nodes into blocks to reduce #block I/O when traversing a path from the root to a leaf
Operations on kd-trees • Lookup, inserts, and deletes are similar to the those on binary search trees • Partial-match queries • If query specifies value for the attribute of a node, traverse only one of the node’s children; else must traverse both • Range queries • If query range for the node’s attribute straddles the node’s value for the attribute, then traverse both children, else traverse the left or right child depending on whether attribute range is smaller or larger than the node’s value for the attribute
Quad trees • A quad tree for 2-D is a 4-ary tree with • interior node corresponding to squares of the 2D space • The children of a node correspond to a partition of the node’s square into quadrants • Leaf nodes having a pointer to a bucket/block that contains all the records that fall within its square • In K-dimensions, interior nodes correspond to k-D cubes
R-trees • An R-tree captures the spirit of B-trees for higher dimensional spaces • B-trees • The keys stored at interior nodes of the B-tree divide the 1D line into intervals, with each pointer having one associated interval • This is a minimum enclosing interval for the keys stored in the subtree pointed to by that pointer • Thus, can think of each node having an associated minimum enclosing interval • The intervals for all the siblings are disjoint • R-trees • Leaves store data regions/shapes • Each interior node of an R-tree is associated a minimum enclosing regions (eg rectangle) for the shapes within the subtree represented by that node • The regions for the children of a node may not be disjoint
Operations on R-trees • where-am-I queries • Traverse all children of a node whose region contains query point • Upon reaching a leaf, examine each shape there to find those that actually contain the query point • NN queries • Skip all children of a node whose regions have least distance from query point larger than the maximum distance of a child’s region from the query point • Inserting a shape • Choose a path from the root to a leaf whose regions need to be expanded the least if the given shape was to be inserted into that leaf • Path could be chosen to minimize empty space of regions, perimeter of regions, etc • If leaf is full, we split it and proceed in a similar manner as in B-trees to insert the new leaf into the parent of the old leaf, etc
GiST • Generalized Index Structures
Bitmap Indices • Bitmap indices are a special type of index designed for efficient querying on multiple keys • Records are assumed to be numbered sequentially • Given a number n it must be easy to retrieve record n • Particularly easy if records are of fixed size • Useful for attributes with small enumerated domains • E.g. gender, country, state, income ranges, etc…
Bitmap Indices • A bitmap index is simply a 2D array of bits • One dimension for the record • One dimension for all the values of attribute(s) of interest • In its simplest form a bitmap index on an attribute has • a bitmap for each value v of the attribute • Bitmap has as many bits as records • the bit for a record is 1 iff the record has the value v for the attribute • Bitmap for each value of an attribute could be compressed • We can • use B-tree to store all bitmaps for an attribute • have one bitmap for multiple values • compress bitmaps
Operations on bitmap indices • Bitmaps are useful for queries on multiple attributes • not particularly useful for single attribute queries • Queries are answered using bitmap operations • Intersection (and) • Union (or) • Complementation (not) • Each operation takes two bitmaps of the same size and applies the operation on corresponding bits to get the result bitmap • E.g. 100110 AND 110011 = 100010 100110 OR 110011 = 110111 NOT 100110 = 011001 • Males with income level L1: 10010 AND 10100 = 10000 • Can then retrieve required tuples. • Counting number of matching tuples is even faster
Operations on bitmap indices • Bitmaps are small compared to the data file size • Bitmaps are packed into words • 1 CPU instruction to do bitmap computing on words (32 or 64 bits) • e.g. two 1Mbit bitmaps can be anded with just 31,250 instructions • Counting number of 1s can be done fast • use map with #ones in each word to accumulate the #ones in bitmap • Deletion needs to be handled properly • Existence bitmapto note if there is a valid record at a record location • Needed for complementation • not(A=v): (NOT bitmap-A-v) AND ExistenceBitmap • Should keep bitmaps for all values, even null value • To correctly handle SQL null semantics for NOT(A=v): • intersect above result with (NOT bitmap-A-Null)