250 likes | 480 Views
The Priority R-Tree: A Practically Efficient and Worst-Case Optimal R-Tree. Lars Arge 1 , Mark de Berg 2 , Herman Haverkort 3 and Ke Yi 1 Department of Computer Science Duke University Department of Computer Science TU Eindhoven Institute of Information and Computing Sciences
E N D
The Priority R-Tree: A Practically Efficient and Worst-Case Optimal R-Tree Lars Arge1, Mark de Berg2, Herman Haverkort3 and Ke Yi1 Department of Computer Science Duke University Department of Computer Science TU Eindhoven Institute of Information and Computing Sciences Utrecht University
Problem Definition • Input: • N rectangles in the plane • Window query Q • Output: • All rectangles intersecting Q • Applications • Spatial databases • GIS • CAD • Computer vision • Robotics • …
R-Tree Fanout: Ө(B) B: disk block size • Definition [Guttman84]: • Advantages: • Little redundancy • Multi-purpose • Easy to update G F E B A H I A B C D E F G H I C D
How to Build an R-Tree • Repeated insertions • [Guttman84] • R+-tree [Sellis et al. 87] • R*-tree [Beckmann et al. 90] • Bulkloading • Hilbert R-Tree [Kamel and Faloutos 94] • Top-down Greedy Split [Garcia et al. 98] • Advantages: • Much faster than repeated insertions • Better space utilization • Usually produce R-trees with higher quality
R-Tree Variant: Hilbert R-Tree • To build a Hilbert R-Tree (cost: O(N/B logM/BN) I/Os) • Sort the rectangles by the Hilbert values of their centers • Build a B-tree on top • 4D Hilbert R-tree Hilbert Curve
R-Tree Variant: TGS R-Tree (Top-down Greedy Split) • To build a TGS R-tree • Start from the root and buildthe tree top-down • To build one node, use binary cutsuntil the desired fan-out is reached • To make a binary cut, consider4 orderings of the rectangles: xmin, ymin, xmax, ymax • In each ordering, consider the B cutting positions • Choose the one that minimizes the sum of the areas of the two resulted bounding boxes • Typical bulk-load cost: O(N/B log2N) I/Os
Our Results • None of existing R-tree variants has worst-case query performance guarantee! • In the worst-case, a query can visit all nodes in the tree even when the output size is zero • Priority R-Tree • The first R-tree variant that answers a query by visiting nodes in the worst case • T: Output size • It is optimal! • There exists a dataset such that for any R-tree, there is an empty query that visits nodes. [Kanth and Singh 99, Agarwal et al. 02]
Roadmap • Pseudo-PR-Tree • Has the desired worst-case guarantee • Not a real R-tree • Transform a pseudo-PR-Tree into a PR-tree • A real R-tree • Maintain the worst-case guarantee • Experiments • PR-tree • Hilbert R-tree (2D and 4D) • TGS-R-tree
Building a Pseudo-PR-Tree priority leaves root Step 1: take out B extreme rectangles from each direction and put them into priority leaves
Building a Pseudo-PR-Tree Step 2: Divide by the xmin coordinates and build subtrees recursively. Division is performed using xmin, ymin, xmax, ymax in a round-robin fashion, like a 4D kd-tree root Analysis sketch: # nodes with at least one priority leafcompletely reported: O(T/B) # nodes with no priority leaf completely reported:
Query Complexity Remains Unchanged Next level: # nodes visited on leaf level
PR-Tree: Bulkload & Updates • Bulkload • O(N/B∙log2N) I/Os→O(N/B∙logM/BN) I/Os, using “grid method” [Agarwal et al. 01] • The same as Hilbert R-tree, but with a larger constant • Updates • Can use any previous heuristic to update in O(logBN) I/Os • Without worst-case query guarantee • Use logarithmic method • Insert: O(logBN + 1/B · logM/BN log2(N/M)) I/Os • Delete: O(logBN) I/Os • Extending to d-dimensions • Query bound: O((N/B)1-1/d + T/B), still optimal • Bulkload & update bounds remain the same
Experiments • Implemented with TPIE • Priority R-tree • Hilbert R-tree • 4D Hilbert R-tree • TGS R-tree • Real-life data • TIGER datasets • 16 million rectangles • Synthetic data • Varying from normal to extreme data • 10 million rectangles
Experiments with Real-Life Data Query performance on the TIGER datasets Shown: # I/Os spent in answering a query T/B
Experiments with Synthetic Data: SIZE Each side of a rectangle is uniformly distributed in [0, max_side] Queries are squares with area 1%
Experiments with Synthetic Data: ASPECT Fix the area, vary aspect ratio
Experiments with Synthetic Data: SKEWED Randomly place points, then do y’=yc on the y-coordinates
Conclusions • In theory • The PR-tree is the first R-tree variant that answers a window query in I/Os worst-case, which is optimal • In practice • Roughly the same as previous best R-trees on real-life and relatively nicely distributed data • Outperforms them significantly on more extreme data • Future work • How previous heuristics may affect the performance of the PR-tree in the dynamic case
Lower Bound Construction • Each bounding box intersects at leastqueries • N/B bounding boxes • queries • There exists a query that intersects at least bounding boxes
Pseudo-PR-Tree: Query Complexity • Nodes v visited where all rectangles in at least one of the priority leaves of v’s parent are reported: O(T/B) • Let v be a node visited but none of the priority leaves at its parent are reported completely, consider v’s parent u 2D 4D Q ymin = ymax(Q) xmax = xmin(Q)
Pseudo-PR-Tree: Query Complexity • The cell in the 4D kd-tree of u is intersected by two different 3-dimensional hyper-planes • The intersection of each pair of such 3-dimensional hyper-planes is a 2-dimensional hyper-plane • Lemma: # of cells in a d-dimensional kd-tree that intersect an axis-parallel f-dimensional hyper-plane is O((N/B)f/d) • So, # such cells in a 4D kd-tree: • Total # nodes visited: u
Experiments with Real-Life Data • Datasets: TIGER/Line data • Bulk-loading: