1 / 41

Improving the R*-tree with Outlier Handling Techniques

Improving the R*-tree with Outlier Handling Techniques. Tian Xia, Donghui Zhang { tianxia, donghui }@ccs.neu.edu College of Computer & Information Science Northeastern University. Talk Outline. Background and Our Motivation The R O -tree: Structure and Operations Querying the R O -tree

lester-shaw
Download Presentation

Improving the R*-tree with Outlier Handling Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving the R*-tree with Outlier Handling Techniques Tian Xia, Donghui Zhang { tianxia, donghui }@ccs.neu.edu College of Computer & Information Science Northeastern University ACM GIS'05, Bremen, Germany

  2. Talk Outline • Background and Our Motivation • The RO-tree: Structure and Operations • Querying the RO-tree • Experimental Results • Conclusions ACM GIS'05, Bremen, Germany

  3. The R*-tree in Spatial Databases • The R*-tree is a balanced disk-based tree structure. • Spatial objects are clustered based on the proximity of their locations. • Each sub-tree is bounded by the minimum bounding rectangle (MBR) of all object in it. ACM GIS'05, Bremen, Germany

  4. The R*-tree: Example o1 R1 R3 o2 R1 R2 o3 o4 R4 o5 R3 R4 R5 R6 o6 R2 o7 R5 o8 o9 o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 R6 o10 ACM GIS'05, Bremen, Germany

  5. Our Motivation • Observation: If an object O is far away from others (or large), MBRs containing O are large. • It is inevitable in the R*-tree! All objects have to be contained by some leaf node and every leaf node has a minimum content requirement. ACM GIS'05, Bremen, Germany

  6. Our Motivation o1 o2 o3 ACM GIS'05, Bremen, Germany

  7. Outlier Objects • Objects far away from other objects (clusters) or with large extent are outliers. • Outliers cause the MBRs in the R*-tree to become large and badly affect the query performance. • Increase the dead space (space inside MBRs that contains no object). • Increase the overlap area ACM GIS'05, Bremen, Germany

  8. Range Query o1 o2 Q o3 ACM GIS'05, Bremen, Germany

  9. Our Solution • We treat outliers separately! o1 o2 Q o3 ACM GIS'05, Bremen, Germany

  10. Our Goal • Existing query and update algorithms can be adapted easily for the new structure. • To see how much performance improvement this idea brings, by running extensive experiments. ACM GIS'05, Bremen, Germany

  11. Five Popular Queries on the R*-tree • Range query • Aggregation query • Nearest Neighbor query • Skyline query • Spatial Join query. ACM GIS'05, Bremen, Germany

  12. Talk Outline • Background and Our Motivation • The RO-tree: Structure and Operations • Querying the RO-tree • Experimental Results • Conclusions ACM GIS'05, Bremen, Germany

  13. The RO-tree: Outlier Handling Tree • The RO-tree is also a height-balanced, disk-based tree structure, similar to the R*-tree. • In the RO-tree, objects could appear in the index nodes, not only in the leaf nodes. • The tree still maintains the minimum fan-out. • Each index node contains at least m index entries (m = 40% of the node capacity). ACM GIS'05, Bremen, Germany

  14. The RO-tree: Structure Overview o1 o3 R1 R1 R2 o1 R3 R4 o2 R3 R4 R5 R6 R5 R6 R2 o2 o3 ACM GIS'05, Bremen, Germany

  15. The RO-tree: Structure • Each index node contains index entry part and object part. • How to allocate space between two parts? • A naïve way is to set a fixed cut-off point, e.g. 2m entries store index entry part. Not space efficient! • Dynamic allocation: as long as there is space, either an index entry or an object entry can be put into the node. ACM GIS'05, Bremen, Germany

  16. Insertion in the RO-tree • To insert an object O into the sub-tree rooted by node N, • If O is contained in the MBR of an index entry E, recursively insert O into the sub-tree rooted by the referenced node of E. • Otherwise, O is stored in N as an “outlier”. ACM GIS'05, Bremen, Germany

  17. Insertion: Example o1 o3 o4 R1 R1 R2 o1 R3 o4 R4 o2 R3 R4 R5 R6 R5 R6 R2 o2 o3 ACM GIS'05, Bremen, Germany

  18. Overflow Treatment (Index nodes) • Two choices: split the node or demote an object to lower level? • # index entries is smaller than 2m, demote an object. Here, m is the minimum fan-out. • No object: split the node. • Experimental results showed # index entries= m+M/2 (M is the capacity of a node) is a breaking point. • Split if # index entries≥ m+M/2. ACM GIS'05, Bremen, Germany

  19. Overflow Treatment: Example o1 o3 o4 o5 R1 R1 R2 o1 R3 o4 R4 o2 R3 R4 R5 R6 R5 R6 R2 o2 o5 o3 ACM GIS'05, Bremen, Germany

  20. Overflow Treatment: Example o1 o3 o5 R1 R1 R2 o1 R3 o4 R4 o4 o2 R3 R4 R5 R6 R5 R6 R2 o2 o5 o3 ACM GIS'05, Bremen, Germany

  21. Split of An Index Node • To maintain the minimum fan-out, split is based on the index entries only. • Outlier objects in the index node are then assigned to one of two new nodes with least expansion of the MBR. ACM GIS'05, Bremen, Germany

  22. Re-insertion in the RO-tree • Re-insertion is utilized to identify outliers in a page. -- A way to promote objects to higher levels. • Hidden outliers can be re-identified! • The RO-tree incorporates the improved re-insertion, proposed in our previous work [ZX, GIS’04]. ACM GIS'05, Bremen, Germany

  23. Deletion and Underflow Treatment • Deletion of objects can happen both in index levels and in leaf level. • To re-insert all entries in an underflow node may be expensive! • E.g., an index node underflows, while it can still be fully occupied, as many outlier objects may exist in the node. ACM GIS'05, Bremen, Germany

  24. Underflow Treatment • If a leaf node underflows, we first try to drag down an outlier object from its parent if possible to resolve the underflow. • If an index node underflows, we first insert the outlier objects into its sub-trees. • Chances are that, some child page may split and the underflow of the parent is resolved. ACM GIS'05, Bremen, Germany

  25. Talk Outline • Background and Our Motivation • The RO-tree: Structure and Operations • Querying the RO-tree • Experimental Results • Conclusions ACM GIS'05, Bremen, Germany

  26. In General • Queries on the R*-tree can be easily adapted for the RO-tree by considering the objects stored in the index nodes. • Query performance on the RO-tree is better: • Smaller MBRs in the RO-tree reduce the dead space and overlap area. • Some queries can be stopped before reaching the leaf level. ACM GIS'05, Bremen, Germany

  27. Aggregation Query in the R*-tree • Aggregate operator: count • Each index entry is augmented with the total number of objects in its sub-tree. R1 R3 R4 R5 R6 R2 ACM GIS'05, Bremen, Germany

  28. Aggregation Query in the RO-tree • Aggregate operator: count • Each index entry is augmented with the total number of objects in its sub-tree. R1 R3 o4 R4 R5 R6 R2 ACM GIS'05, Bremen, Germany

  29. Nearest Neighbor Query in the R*-tree R1 R3 Q R4 R5 R6 R2 ACM GIS'05, Bremen, Germany

  30. Nearest Neighbor Query in the RO-tree R1 o1 R3 Q R4 R5 R6 R2 ACM GIS'05, Bremen, Germany

  31. Talk Outline • Background and Our Motivation • The RO-tree: Structure and Operations • Querying the RO-tree • Experimental Results • Conclusions ACM GIS'05, Bremen, Germany

  32. Datasets and Setup • NE: 123,593 postal addresses (points). • US: 81,043 railroads (line segments). • CAmix: 62,556 locations (points) and 7,697 poly-lines (large extent objects). • Page size: 1KB, 2KB, 3KB, 4KB. • Fan-out for both the RO-tree and the R*-tree is 40% of the node capacity. ACM GIS'05, Bremen, Germany

  33. Range Query (NE dataset) ACM GIS'05, Bremen, Germany

  34. Aggregation Query (NE dataset) ACM GIS'05, Bremen, Germany

  35. Nearest Neighbor Query ACM GIS'05, Bremen, Germany

  36. Skyline Query ACM GIS'05, Bremen, Germany

  37. Spatial Join Query ACM GIS'05, Bremen, Germany

  38. Performance Comparison ACM GIS'05, Bremen, Germany

  39. Talk Outline • Background and Our Motivation • The RO-tree: Structure and Operations • Querying the RO-tree • Experimental Results • Conclusions ACM GIS'05, Bremen, Germany

  40. Conclusions • We explored the idea of identification and storing outlier objects at higher levels of the spatial tree index. • We proposed a simple but effective index structure, the RO-tree, which handles outlier objects gracefully. • We showed how to adapt existing query algorithms on the RO-tree. • Extensively experiments showed significant query improvements over the R*-tree. ACM GIS'05, Bremen, Germany

  41. Thank you! ACM GIS'05, Bremen, Germany

More Related