410 likes | 511 Views
Improving the R*-tree with Outlier Handling Techniques. Tian Xia, Donghui Zhang { tianxia, donghui }@ccs.neu.edu College of Computer & Information Science Northeastern University. Talk Outline. Background and Our Motivation The R O -tree: Structure and Operations Querying the R O -tree
E N D
Improving the R*-tree with Outlier Handling Techniques Tian Xia, Donghui Zhang { tianxia, donghui }@ccs.neu.edu College of Computer & Information Science Northeastern University ACM GIS'05, Bremen, Germany
Talk Outline • Background and Our Motivation • The RO-tree: Structure and Operations • Querying the RO-tree • Experimental Results • Conclusions ACM GIS'05, Bremen, Germany
The R*-tree in Spatial Databases • The R*-tree is a balanced disk-based tree structure. • Spatial objects are clustered based on the proximity of their locations. • Each sub-tree is bounded by the minimum bounding rectangle (MBR) of all object in it. ACM GIS'05, Bremen, Germany
The R*-tree: Example o1 R1 R3 o2 R1 R2 o3 o4 R4 o5 R3 R4 R5 R6 o6 R2 o7 R5 o8 o9 o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 R6 o10 ACM GIS'05, Bremen, Germany
Our Motivation • Observation: If an object O is far away from others (or large), MBRs containing O are large. • It is inevitable in the R*-tree! All objects have to be contained by some leaf node and every leaf node has a minimum content requirement. ACM GIS'05, Bremen, Germany
Our Motivation o1 o2 o3 ACM GIS'05, Bremen, Germany
Outlier Objects • Objects far away from other objects (clusters) or with large extent are outliers. • Outliers cause the MBRs in the R*-tree to become large and badly affect the query performance. • Increase the dead space (space inside MBRs that contains no object). • Increase the overlap area ACM GIS'05, Bremen, Germany
Range Query o1 o2 Q o3 ACM GIS'05, Bremen, Germany
Our Solution • We treat outliers separately! o1 o2 Q o3 ACM GIS'05, Bremen, Germany
Our Goal • Existing query and update algorithms can be adapted easily for the new structure. • To see how much performance improvement this idea brings, by running extensive experiments. ACM GIS'05, Bremen, Germany
Five Popular Queries on the R*-tree • Range query • Aggregation query • Nearest Neighbor query • Skyline query • Spatial Join query. ACM GIS'05, Bremen, Germany
Talk Outline • Background and Our Motivation • The RO-tree: Structure and Operations • Querying the RO-tree • Experimental Results • Conclusions ACM GIS'05, Bremen, Germany
The RO-tree: Outlier Handling Tree • The RO-tree is also a height-balanced, disk-based tree structure, similar to the R*-tree. • In the RO-tree, objects could appear in the index nodes, not only in the leaf nodes. • The tree still maintains the minimum fan-out. • Each index node contains at least m index entries (m = 40% of the node capacity). ACM GIS'05, Bremen, Germany
The RO-tree: Structure Overview o1 o3 R1 R1 R2 o1 R3 R4 o2 R3 R4 R5 R6 R5 R6 R2 o2 o3 ACM GIS'05, Bremen, Germany
The RO-tree: Structure • Each index node contains index entry part and object part. • How to allocate space between two parts? • A naïve way is to set a fixed cut-off point, e.g. 2m entries store index entry part. Not space efficient! • Dynamic allocation: as long as there is space, either an index entry or an object entry can be put into the node. ACM GIS'05, Bremen, Germany
Insertion in the RO-tree • To insert an object O into the sub-tree rooted by node N, • If O is contained in the MBR of an index entry E, recursively insert O into the sub-tree rooted by the referenced node of E. • Otherwise, O is stored in N as an “outlier”. ACM GIS'05, Bremen, Germany
Insertion: Example o1 o3 o4 R1 R1 R2 o1 R3 o4 R4 o2 R3 R4 R5 R6 R5 R6 R2 o2 o3 ACM GIS'05, Bremen, Germany
Overflow Treatment (Index nodes) • Two choices: split the node or demote an object to lower level? • # index entries is smaller than 2m, demote an object. Here, m is the minimum fan-out. • No object: split the node. • Experimental results showed # index entries= m+M/2 (M is the capacity of a node) is a breaking point. • Split if # index entries≥ m+M/2. ACM GIS'05, Bremen, Germany
Overflow Treatment: Example o1 o3 o4 o5 R1 R1 R2 o1 R3 o4 R4 o2 R3 R4 R5 R6 R5 R6 R2 o2 o5 o3 ACM GIS'05, Bremen, Germany
Overflow Treatment: Example o1 o3 o5 R1 R1 R2 o1 R3 o4 R4 o4 o2 R3 R4 R5 R6 R5 R6 R2 o2 o5 o3 ACM GIS'05, Bremen, Germany
Split of An Index Node • To maintain the minimum fan-out, split is based on the index entries only. • Outlier objects in the index node are then assigned to one of two new nodes with least expansion of the MBR. ACM GIS'05, Bremen, Germany
Re-insertion in the RO-tree • Re-insertion is utilized to identify outliers in a page. -- A way to promote objects to higher levels. • Hidden outliers can be re-identified! • The RO-tree incorporates the improved re-insertion, proposed in our previous work [ZX, GIS’04]. ACM GIS'05, Bremen, Germany
Deletion and Underflow Treatment • Deletion of objects can happen both in index levels and in leaf level. • To re-insert all entries in an underflow node may be expensive! • E.g., an index node underflows, while it can still be fully occupied, as many outlier objects may exist in the node. ACM GIS'05, Bremen, Germany
Underflow Treatment • If a leaf node underflows, we first try to drag down an outlier object from its parent if possible to resolve the underflow. • If an index node underflows, we first insert the outlier objects into its sub-trees. • Chances are that, some child page may split and the underflow of the parent is resolved. ACM GIS'05, Bremen, Germany
Talk Outline • Background and Our Motivation • The RO-tree: Structure and Operations • Querying the RO-tree • Experimental Results • Conclusions ACM GIS'05, Bremen, Germany
In General • Queries on the R*-tree can be easily adapted for the RO-tree by considering the objects stored in the index nodes. • Query performance on the RO-tree is better: • Smaller MBRs in the RO-tree reduce the dead space and overlap area. • Some queries can be stopped before reaching the leaf level. ACM GIS'05, Bremen, Germany
Aggregation Query in the R*-tree • Aggregate operator: count • Each index entry is augmented with the total number of objects in its sub-tree. R1 R3 R4 R5 R6 R2 ACM GIS'05, Bremen, Germany
Aggregation Query in the RO-tree • Aggregate operator: count • Each index entry is augmented with the total number of objects in its sub-tree. R1 R3 o4 R4 R5 R6 R2 ACM GIS'05, Bremen, Germany
Nearest Neighbor Query in the R*-tree R1 R3 Q R4 R5 R6 R2 ACM GIS'05, Bremen, Germany
Nearest Neighbor Query in the RO-tree R1 o1 R3 Q R4 R5 R6 R2 ACM GIS'05, Bremen, Germany
Talk Outline • Background and Our Motivation • The RO-tree: Structure and Operations • Querying the RO-tree • Experimental Results • Conclusions ACM GIS'05, Bremen, Germany
Datasets and Setup • NE: 123,593 postal addresses (points). • US: 81,043 railroads (line segments). • CAmix: 62,556 locations (points) and 7,697 poly-lines (large extent objects). • Page size: 1KB, 2KB, 3KB, 4KB. • Fan-out for both the RO-tree and the R*-tree is 40% of the node capacity. ACM GIS'05, Bremen, Germany
Range Query (NE dataset) ACM GIS'05, Bremen, Germany
Aggregation Query (NE dataset) ACM GIS'05, Bremen, Germany
Nearest Neighbor Query ACM GIS'05, Bremen, Germany
Skyline Query ACM GIS'05, Bremen, Germany
Spatial Join Query ACM GIS'05, Bremen, Germany
Performance Comparison ACM GIS'05, Bremen, Germany
Talk Outline • Background and Our Motivation • The RO-tree: Structure and Operations • Querying the RO-tree • Experimental Results • Conclusions ACM GIS'05, Bremen, Germany
Conclusions • We explored the idea of identification and storing outlier objects at higher levels of the spatial tree index. • We proposed a simple but effective index structure, the RO-tree, which handles outlier objects gracefully. • We showed how to adapt existing query algorithms on the RO-tree. • Extensively experiments showed significant query improvements over the R*-tree. ACM GIS'05, Bremen, Germany
Thank you! ACM GIS'05, Bremen, Germany