Performance Comparison of xBR -trees and R*-trees for Single Dataset Spatial Queries

Fifteenth East-European Conference on Advances in Databases and Information Systems September 19th–23rd, 2011, Vienna, Austria Performance Comparison of xBR-trees and R*-trees for Single Dataset Spatial Queries George Roumelis (Master in Information Systems, Open University of Cyprus, Cyprus) Michael Vassilakopoulos (*) (Dept. of Computer Science and Biomedical Informatics, University of Central Greece, Greece) Antonio Corral (Department of Languages and Computing, University of Almeria, Spain) * speaker

Outline • Motivation • Contribution and Background • R-tree and R*-tree • XBR-tree, Internal Nodes and Leaf Nodes • Single Dataset query processing on XBR-trees • Experimental Results: Settings,Tree Building, Single Dataset Queries • Conclusions and Future Work

Motivation • In applications, a variety of Spatial Queries arise, • involving one spatial dataset, like Point Location, Window, Distance Range ,Nearest NeighborQueries • involving two spatial datasets, like Distance Join, Closest Pair, All-Nearest Neighbor Queries • Usually, there is no overall performance winner • Many researchers have compared different Access Methods, regarding their I/O and execution time performance, for a variety a Spatial Queries • In this work, we compare the popular R*-tree and the External Balanced Regular (xBR) tree for Single Dataset Spatial Queries

Contribution We implement the xBR-tree and present conclusions arising from the (real data based) experimental comparison of xBR-trees and R*-trees regarding I/O Performance and Execution Time for • Tree building • Point Location Queries (PLQs) • Window Queries (WQs) • Distance Range Queries (DRQs) • K-Nearest Neighbor Queries (K-NNQs) • Constrained K-Nearest Neighbor Queries (CK-NNQs)

Background (1) • Given an index I and a query point q, the PLQ returns true if q belongs to I and false otherwise • Given an index I and a query rectangle r, the result of the WQ is the set of all points in I that are completely inside r • Given an index I, a query point q and a distance threshold delta >= 0, the DRQ returns all points of I that are within the delta distance from q (according to a distance function)

Background (2) • Given an index I, a query point q, and a value K > 0, the K-NNQ returns K points of I which are closest to q (according to a distance function) • Given an index I, a query point q, a value K > 0 and a distance threshold delta >= 0, the CK-NNQ returns K closest points of I which are within delta distance from q (according to a distance function)

Nested MBRs can be organized as a tree (R-tree) R-Tree R10 R11 R12 Clusters of spatial objects can be recursively grouped into Minimum Bounding Rectangles – MBRs R10 R11 R12 R1 R4 R2 R5 R1 R2 R3 R4 R5 R6 R7 R8 R9 R6 R3 R9 R7 R8 Nodes that contain points 7

R*-Tree • TheR*-tree is the most popular R-tree variation • It added two major enhancements to the original R-tree, when a node overflow is caused • First, rather than just considering the area, the node-splitting algorithm in R*-trees also minimizes the perimeter and overlap enlargement of the MBRs • Second, an overflowed node is not split immediately, but a portion of entries of the node is reinserted from the top of the R*-tree (forced reinsertion) 8

xBR-Tree • xBR-trees can be defined for various dimensions • For the ease of exposition, we assume 2 dimensions • For 2 dimensions the hierarchical decomposition of space is that of Quadtrees. • The space indexed by an xBR-tree is a square, expressed in a coordinate system of real numbers • The nodes of xBR-trees are disk pages and are distinguished in two kinds: leaves, which store the actual multidimensional points and internal nodes, which provide a multiway indexing mechanism for these data

Internal Nodes • Each node consists of a non-predefined number of entries of the form <shape, address, REG, pointer> • An addressis formed by directional digits (NW, NE, SW, SE), determines the region of a child node and is accompanied by the pointer to this child • Shape is a flag that determines if the region of the child is a complete square (used widely in queries) • REG stores the coordinates of the region referenced by address (it is more expensive to calculate it) • The region of a child is the subquadrant of its address minus the subquarants of the next addresses in the same node

An xBR-tree example • The region of the root is the original space (a quadrangle) • The * symbol denotes the end of an address • The address of the right child is 0*, since the region of this child is the NW quadrant of the original space • The address of the left child is *: its region is the whole space minus the region of the right child

Processing PLQs and WQs • Queries are processed in a top-down manner on the xBR-tree, like on the R*-tree • During a PLQ for a specified point, the appropriate leaf is determined by descending the tree from the root (unlike the R*-tree, a single path is followed) • During WQs, we examine if the subquadrant of the current internal-node entry and the query window intersect and follow the pointer to the related child • We repeat until we have examined all entries of the internal node, or until the query window is completely inscribed inside the region of the entry that we examine

Processing DRQs • The DRQ follows the same strategy as the WQ • At first, the querying circle is replaced from its MBR (the calculations are faster in this way) and if the answer about the intersection of the subquadrant of the current entry and the query MBR is positive, then we follow the pointer to the related child at the next lower level • If we reach a leaf with a region that intersects the query MBR, we select the points in the leaf that are inside the query circle

Processing K-NNQs • The K-NNQ algorithm follows a DF tree traversal • In an internal node, entries are visited according to their mindist from the query point • The process is repeated recursively until the leaf level is reached, where a potential next NN is found • It is possible to reach a leaf, but the next NN may exist in a neighboring region. Thus, we use a global max K-heap and insert in it every point of this leaf that is nearest to the query point than the root of the heap • When the heap is full, according to a set of conditions*, the search is stopped • *Roumelis et al., Nearest Neighbor Algorithms Using xBR-Trees, PCI 2011

Experimental Results Experimental settings • We used 5 real datasets of different sizes (CSN: 98022 line-segments, NApp: 24493 points, NAcl: 9203 points, Narr: 191637 line-segments, Nard: 569120 line-segments) • To create 2d point datasets from non-point datasets, we used the centroids of the line-segment MBRs • Environment used: Linux machine, Intel core duo 2x2GHz processor, 3 GB of RAM, gcc • Performance measurements: I/O activity (page accesses) and Execution Time • An extended set of experimental results is accessible from: http://delab.csd.auth.gr/~michalis/xBRsys/results

Tree Construction • In all cases, the xBR-tree uses less space (i.e. it is more compact) and time than the R*-tree (the R*-tree creation is slower, partially, due to the use of forced reinsertion that improves searching efficiency) • The difference in creation time is enlarged as the size of node increases

Point Location Query • The xBR-tree needs less read accesses and executes every query faster than the R*-tree • The xBR-tree needs a number of disk accesses equal to its tree height, while the R*-tree needs at least this number of access and, in most cases, even more

Window Query (1) • The xBR-tree needs more page accesses • As the size of node increases the I/O difference between the two trees becomes smaller • In both trees, a linear dependence of the number of accesses to the size of the node appears • This is due to reduction of tree height as the size of node increases

Window Query (2) • Query windows that were inhabited by points (non-empty windows) were used • The xBR-tree becomes clearly faster (execution time) for all sizes of nodes and the I/O efficiency of the two trees iscloser • Mainmemory processing is simpler (and thus faster) for the xBR-tree

Distance Range Query • Τhe xBR-treeneeds less disk accesses and is faster than the R*-tree, in all cases and for alldatasets • The resultswere even better when the DRQs addressed only non-empty regions

K- Nearest Neighbor Query • The xBR-treeneeds more disk accesses than the R*-tree, but thedifference gets smaller when the size of node increases • Regarding the execution time, the xBR-tree shows improvedperformance, in relation to its I/O difference from the R*-tree • The worse time performance of both trees is for larger node sizes • This is due to the fact that as the node size increases, the trees become very wide and very short

Constrained K-NN Query • Studying the results of several CK-NNQ experiments, we see that the xBR-tree is improved for both performance categories • Depending on the dataset, for non-empty regions the CK-NNQ time performance of the xBR-tree is almost the same to, or much better than the R*-tree

Conclusions • We performed an extensive (real data based) experimental comparison of the performance of xBR-trees and R*-trees, for spatial queries where a single index is involved • The conclusions arising from this comparison show that the two structures are competitive • The xBR-tree is smaller and is built faster than the R*-tree • The performance of the xBR-tree is higher for PLQs and DRQs and for WQs when the query window is non-empty • The R*-tree is better for K-NNQs and needs less disk access for CK-NNQs • The execution time winner for CK-NNQs depends on whether the query returns result points, or not

Future Work • To extend the xBR-tree for modeling empty regions too • To use in the xBR-tree internal-node address representation that facilitates main memory computation • To study the relative performance of the two structures using memory consuming BF algorithms • To study the relative performance of the two trees for two dataset (join) queries • To study the relative performance of the two structures in the presence of buffering

Thank you for your attention mvasilako@ucg.gr http://users.ucg.gr/~mvasilako

Performance Comparison of xBR -trees and R*-trees for Single Dataset Spatial Queries