490 likes | 510 Views
Explore the importance of indexing in multidimensional data structures for quick access and efficient search methods. Learn about K-d Trees, Point Quadtrees, MX-Quadtrees, R-Trees, and more for storing and searching multidimensional data effectively.
E N D
Multidimensional Data Structures • Why do we need indexing? • Quick access • Search methods • sequential search • binary search • balanced search tree (B or B+ tree) • Why do we need new indexing structures • Traditional data is one-dimensional • multimedia data is multidimensional • in general, if a given information has k features, it can be represented by a k-dimensional space, where each dimension corresponds to one feature
What kind of queries may be expected? • Given a set of points in k-dimensional space • we may want to see if the point is in the set or not (exact match) • we may want to find the closest points to the given point (similarity based search or nearest neighbor queries) • given a region, we may want to find all the points in the given region (range query) • Approach • divide the space into regions • insert the new object corresponding region • if the region is full, split the region • Query • determine which regions are required to answer the query, and limit the search to these regions
Multidimensional Data Structures • K-d Trees • Point Quadtrees • MX-Quadtrees • R-Trees • Many others exist • We do not discuss them in the class
K-d Trees • Used to store k dimensional point data • Not used to store region data • A 2-d tree (k=2) stores 2-dimensional point data while 3-d tree stores 3-dimensional point data, ..
2-d trees • Node Structure • nodetype = record • INFO: infotype; • XVAL: real; • YVAL: real; • LLINK: nodetype; • RLINK: nodetype; • end • INFO filed is any user-defined type • XVAL and YVAL denote the coordinates of a point associated with the node • LLINK and RLINK fields point to two children INFO XVAL YVAL LLINK RLINK
2-d Trees • 2-d tree is a binary tree satisfying the following properties • If N is node in the tree such that level(N) is even, then • every node M in the subtree rooted at N.LLINK satisfies M.XVAL < N.XVAL • every node P in the subtree rooted at N.RLINK satisfies P.XVAL N.XVAL • If N is node in the tree such that level(N) is odd, then • every node M in the subtree rooted at N.LLINK satisfies M.YVAL < N.YVAL • every node P in the subtree rooted at N.RLINK satisfies P.YVAL N.YVAL
2-d Trees: Insertion/Search • To insert N into a tree pointed by T, • Check if N and T agree on their XVAL and YVAL • IF so, just overwrite T and we are done • Else branch left if N.XVAL < T.XVAL and branch right otherwise • Suppose P is the child. If N and P agree on their XVAL and YVAL, overwrite P and we are done, else branch left if N.YVAL < P.YVAL and branch right otherwise • Repeat this procedure, branching on XVALs when we are at even levels, and YVALs when we are at odd levels in the tree A2 30,50 A1 >= 30 60 A1 < 30 (30,50) 60,10 40 A2 >= 10 A2 < 10 (45,20) 45,20 20 (60,10) A1 >= 45 A1 < 45 A1 20 40 60
Another example of 2-d tree (XVAL,YVAL) City Banja Luka (19,45) Derventa (40,50) Toslic (38,38) Tuzla (54,35) Sinj (4,4)
(XVAL,YVAL) City Banja Luka (19,45) Derventa (40,50) Toslic (38,38) Tuzla (54,35) Sinj (4,4)
Example of Insertion (XVAL,YVAL) City Banja Luka (19,45) Derventa (40,50) Toslic (38,38) Tuzla (54,35) Sinj (4,4)
Example of Insertion (XVAL,YVAL) City Banja Luka (19,45) Derventa (40,50) Toslic (38,38) Tuzla (54,35) Sinj (4,4)
Example of Insertion (XVAL,YVAL) City Banja Luka (19,45) Derventa (40,50) Toslic (38,38) Tuzla (54,35) Sinj (4,4)
Example of Insertion (XVAL,YVAL) City Banja Luka (19,45) Derventa (40,50) Toslic (38,38) Tuzla (54,35) Sinj (4,4)
Deletion in 2-d Trees • Suppose we wish to delete (x,y) from a 2-d tree T • search for a node N in T with N.XVAL = x and N.YVAL = y • if N is a leaf node, then set the LLINK and RLINK fields of N’s parent to NIL and return N to appropriate storage • otherwise, either the subtree rooted at N.LLINK (Tl) or the subtree rooted at N.RLINK (Tr) is non-empty • Step1: Find a “candidate replacement” node R that occurs either in Tl or in Tr • Step2: Replace all of N’s non-link fields by those of R • Step3: Recursively delete R from Tl or Tr (whichever is applicable) • the above recursion is guaranteed to terminate because Tl (Tr) has strictly smaller height than the original tree T
Finding Candidate Replacement Node • The candidate replacement node R must bear the same spatial relation to all nodes P in both Tl or Tr that N bore to P • I.e, if P is to the southwest of N, then P must be to the southwest of R, if P is to the northwest of N, then P must be to the northwest of R, …. • This means R must satisfy the following properties: • 1. every node M in Tl is such that: M.XVAL < R.XVAL if level(N) is even and M.YVAL < R.YVAL if level(N) is odd • 2. every node M in Tr is such that: M.XVAL R.XVAL if level(N) is even and M.YVALR.YVAL if level(N) is odd
Finding Candidate Replacement Node • If Tr is not empty, and level(N) is even, then any node in Tr with smallest possible XVAL in Tr is a candidate replacement node • If Tr is empty, then we might not be able to find a candidate replacement node from Tl • In this case, find the node R’ in Tl with the smallest possible XVAL filed. Replace N with this • Set N.RLINK = N.LLINK and set N.LLINK = NIL • Recursively delete R’
Range queries in 2-d trees • A range query with respect to a 2-d tree T is a query that specifies a point (xc,yc) and a distance r • the answer is the set of all points (x,y) in T such that (x,y) lies within distance r of (xc,yc) • I.e., a range query defines a circle of radius r centered at (xc,yc), and expects to find all points in the 2-d tree that lie within the circle • recall that each node N in T implicitly represents a region RN • If the circle specified in a query has no intersection with RN, then there is no point in searching the subtree rooted at N
Example of the Range query (xc,yc) = (35,46) r = 9.5
K-d trees • It is a binary tree • Every node contains a data record, a left pointer and a right pointer • At every level of the tree, a different attribute of the tree is used as the discriminator in a round-robin fashion • All algorithms for 2-d trees generalize in the obvious way to k-d trees
Point Quadtrees • Point quad trees always split regions into 4 parts • In a 2-d tree, node N splits a region into two by drawing one line through the point (N.XVAL,N.YVAL) • In a point quadtree, node N splits the region it represents by drawing both horizontal and vertical line through the point (N.XVAL,N.YVAL) • These 4 parts are called the NW, SW, NE, SE quadrants determined by node N; each of these corresponds to a child of N • Node Structure • qtnodetype = record • INFO: infotype; • XVAL: real; • YVAL: real; • NW,SW,NE,SE: qtnodetype; • end
Insertion into Point Quadtrees (XVAL,YVAL) City Banja Luka (19,45) Derventa (40,50) Toslic (38,38) Tuzla (54,35) Sinj (4,4)
Insertion into Point Quadtrees (XVAL,YVAL) City Banja Luka (19,45) Derventa (40,50) Toslic (38,38) Tuzla (54,35) Sinj (4,4)
Insertion into Point Quadtrees (XVAL,YVAL) City Banja Luka (19,45) Derventa (40,50) Toslic (38,38) Tuzla (54,35) Sinj (4,4)
Insertion into Point Quadtrees (XVAL,YVAL) City Banja Luka (19,45) Derventa (40,50) Toslic (38,38) Tuzla (54,35) Sinj (4,4)
Insertion into Point Quadtrees (XVAL,YVAL) City Banja Luka (19,45) Derventa (40,50) Toslic (38,38) Tuzla (54,35) Sinj (4,4)
Insertion into Point Quadtrees (XVAL,YVAL) City Banja Luka (19,45) Derventa (40,50) Toslic (38,38) Tuzla (54,35) Sinj (4,4)
Deletion in Point Quadtrees • If the node being deleted is a leaf node, deletion is trivial: we just set the appropriate link filed of node N’s parent to NIL and return node to storage • otherwise, as in the case of 2-d trees, we need to find an appropriate replacement node
Expanded Node Type • Expand the node structure to the following • qtnodetype = record • INFO: infotype; • XVAL,YVAL: real; • XLB,YLB,XUB,YUB: real {-,+} • NW,SW,NE,SE: qtnodetype; • end • When inserting a node N into T we need to ensure that • If N is the root node, then N.XLB = - , N.YLB = - , N.XUB = +, N.YUB = + • If P is the parent of N (assume w=P.XUB-P.XLB and h=P.YUB-Y.YLB), then Case N.XLB N.XUB N.YLB N.YUB N=P.NW P.XLB P.XLB+w*.5 P.YLB+h*.5 P.YUB N=P.SW P.XLB P.XLB+w*.5 P.YLB P.YLB+h*.5 N=P.NE P.XLB+w*.5 P.XUB P.YLB+h*.5 P.YUB N=P.SE P.XLB+w*.5 P.XUB P.YLB P.YLB+h*.5
Deletion in Point Quadtrees • When deleting an interior node N, we must find a replacement node R in one of the subtrees of N such that • every other node R1 in N.NW is to the northwest of R • every other node R2 in N.SW is to the southwest of R • every other node R3 in N.NE is to the northeast of R • every other node R4 in N.SE is to the southeast of R • In general, it may not always be possible to find such a replacement node • deletion of an interior node N may require reinsertion of all nodes in the subtrees of N • In the worst case, this may require almost all nodes to be reinserted
Range Searches in Point Quadtree • Similar to that of 2-d trees • each node in a point quadtree represents a region • do not search regions that do not intersect the circle defined by the query
The MX-Quadtree • For both 2-d trees and point quadtrees, • the shape of the tree depends on the order in which objects are inserted • the split (into 2 for 2-d and 4 for point quad) may be uneven depending on exactly where the point (N.XVAL,N.YVAL) is located inside the region represented by N • MX-quadtrees (MX stands for matrix) attempt to • ensure that the shape (and height) of the tree are independent of the number of nodes present in the tree as well as the order of insertion of these nodes • provide efficient deletion and search algorithms
The MX-Quadtree • The map being represented is split up into a grid of size (2k2k) of some k • the application developer is free to choose k to reflect the desired granularity, but once chosen, this must be kept fixed • Node structure: exactly same as that of point quadtrees, except that the root represents the region specified by XLB=0, XUB=2k, YLB=0, YUB=2k • When a region gets split, it gets split in the middle • the regions represented by the four children of N (w denotes the width of the region represented by N Child XLB XUB YLB YUB NW N.XLB N.XLB+w/2 N.YLB+w/2 N.YLB+w SW N.XLB N.XLB+w/2 N.YLB N.YLB+w/2 NE N.XLB+w/2 N.XLB+w N.YLB+w/2 N.YLB+w SE N.XLB+w/2 N.XLB+w N.YLB N.YLB+w/2
Deletion in MX-Quadtrees • It is fairly simple because all points are represented at the leaf level • To delete N, • First set the appropriate link of N’s parent (M) to NIL • Check if all the four link fields of M are NIL • If so, examine M’s parent (P), find the link field P.dir1 = M, set P.dir1 = NIL, and see if P’s four link fields are NIL • If so, continue this process • Complexity of deletion is O(k)
Range Queries in MX-Quadtrees • Handled exactly the same way as for point quadtrees, but there are 2 differences • the content of XLB,XUB,YLB,YUB fields are different • as points are stored at the leaf level, checking to see if a point is in the circle defined by the range query needs to be performed only at the leaf level
R-Trees • R-trees are used to store rectangular regions of an image or map • R-trees are particularly useful in storing very large amounts of data on disk • They provide a convenient way of minimizing the number of disk accesses • Each R-tree has an associated order, K • Each non-leaf node contains at most K rectangles and at least K/2 rectangles (except root) (I.e, each non-root node must be at least half full) • This makes R-trees appropriate for disk based retrieval (because each disk access brings back at least K/2 rectangles)
R-Trees Manipulate two kinds of rectangles: “real” and “group”
R-Tree • Node Structure • rtnodetype = record • Rec1, …. RecK: rectangle; • P1,…PK: rtnodetype; • end
Deletion in R-Trees • Deletion may cause a node to “underflow” because an R-Tree must contain at least K/2 rectangles (real or group) (Recall B+-trees) • When we delete a rectangle, we must make sure that the node is not underfull
Comparison of the four data structures • k-d trees • easy to implement • k-d tree with k nodes may have height k • since the trees are binary, search and insertion are expensive • point quadtrees • easy to implement • comparison requires comparison of 2 attributes • deletion is difficult • complexity of range queries O(2n) where n is the number of records in the tree
Comparison of the four data structures • MX-quadtrees • height is at most O(n) where the region represented is composed of (2n2n) cells • insertion, deletion, search: O(n) • range search is very efficient O(N+2h) where N is the number of points in the answer to the query and h is the height of the tree • R-tree • insertion, deletion, search, same as MX-quadtrees • since large number of rectangles are stored in each node, they are appropriate for disk accesses • the bounding rectangles may overlap (I.e., we may have to search via multiple paths)
Other multidimensional data structures • K-d-B trees • hB-tree • PMR-tree • R*-tree • ….
Commercial Systems • Informix’s MapInfo Geocoding datablade: allows assignment of latitudinal and longitudinal elements to records • Informix’s Spatial datablade: employs R-tree • Oracle Universal server provides a spatial data option and is based on quadtree technology • Intergraph’s Land Information System allows integration of survey data, imagery, etc., Allows to create temporal/historical view of landuse • ESRI provides ARC/INFO system • the spatial database engine works with geographic data stored in Oracle, Informix, Sybase, Microsoft SQL server, and DB2 • Interesting to see what data structure do they employ!!