Chapter 3: Data Storage and Access Methods

Chapter 3: Data Storage and Access Methods • Title: The R* Tree: An Efficient and Robust Access Method for Points and Rectangles • Authors:N. Beckmann, H. Kriegel, R. Schneider and B. Seeger • Pages: 207-216

The R* Tree: An Efficient and Robust Access Method for Points and Rectangles • Problem • Problem Statement • Why is this problem important? • Why is this problem hard? • Approaches • Approach description, key concepts • Contributions (novelty, improved) • Assumptions

Problem Statement – R* Tree • Given • Data containing points and rectangles • Spatial queries (point, range query, insert, delete) • Find - An Access Method (Data Structure) • A hierarchical organization of rectangles • Example from wikipedia • Objectives • Efficiency of spatial queries • Constraints • Balanced tree • Each node is a disk page and has >= m (min # of entries) entries. • Root has at least two children unless it is a leaf • Efficiency metric = number of disk-pages accessed

Why is this problem important? • Multi-dimensional Applications • Large geographic data. e.g., Map objects like countries occupy regions of non-zero size in two dimension. • Common real world usage: “Find all museums within 2 miles of my current location". • CAD • … • Many DBMS servers support spatial indices • Orcale, IBM DB2, …

Why is this problem Hard? • B-tree split methods ineffective in 2-dimensions • Ex. Sorting • Size variation across data Rectangles • Large rectangles limit split options! • Non-uniform data distribution over space • Dynamic Access Method • Insertions and deletions • Overlapping directory rectangles => multiple search paths

Novelty of Contribution • Related Work • Traditional one-dimensional indexing structures (e.g., hash, B-tree) are not appropriate for range search • B+ tree • Represents sorted data in a way that allows for efficient insertion and removal of elements. • Dynamic, multilevel index with maximum and minimum bounds on the number of keys in each node. • Leaf nodes are linked together as a linked list to make range queries easy. • R-tree • R-tree is a foundation for spatial access method • A complex spatial object is represented by minimum bounding rectangles while preserving essential geometric properties • Over-lapping regions • Heuristic: minimize the area of each enclosing rectangle in the inner nodes.

Principles of R-tree • Height-balanced tree similar to a B-tree with index records in its leaf nodes containing pointers to data objects. • Heuristic Optimization: minimize the area of each enclosing rectangle in the inner nodes. Reference: A Guttman ‘R-tree a dynamic index structure for spatial searching’, 1984

Performance Parameters beyond R-tree • (Q1) The area covered by a directory rectangle should be minimized. • (Q2) The overlap between directory rectangles should be minimized. • (Q3) The margin of a directory rectangle should be minimized. • (Q4) Storage utilization should be optimized. • Intuitions: • Reduce overlap between sibling nodes. • Reduce traversal of multiple branches for point query • Reinsert old data changes entries between neighboring nodes and thus decreases overlap. • Due to more restructuring, less splits occur

Difference between R-tree and R*-tree • Minimization of area, margin, and overlap is crucial to the performance of R-tree / R*-tree. • The R*-tree attempts to reduce the tree, using a combination of a revised node split algorithm and the concept of forced reinsertion at node overflow. This is based on the observation that R-tree structures are highly susceptible to the order in which their entries are inserted, so an insertion-built (rather than bulk-loaded) structure is likely to be sub-optimal. Deletion and reinsertion of entries allows them to "find" a place in the tree that may be more appropriate than their original location.  Improve retrieval performance

R1 R1 R2 R2 R5 R5 R4 R4 R3 R3 Example Preferred by R-tree R1 R2 R5 R4 R3 Preferred by R*-tree

Validation Methodology • Methodology • Experiments with simulated workloads • Evaluation of design decisions • Results • R*-tree outperforms variants of R-tree and 2-level grid file. • R*-tree is robust against non-uniform data distributions.

Summary • Paper’s focus • R*-tree – implementations and performance • Ideas • Heuristic Optimizations (pp. 208) • Reduction of area, margin, and overlap of the directory rectangles • Better Storage Utilization (pp 211) • Forced Reinsertion (splits can be prevented) • Experimental comparison • Using many data distributions

Assumptions, Rewrite today • Assumptions • Indexing data in two-dimensional space • Bulk load and bulk reorganization not available • Concurrency control and recovery costs are negligible • Reinserts during split! • Rewrite today • Bulk-load of rectangles • Compare with newer methods • R+ tree (disjoint sibling), Hilbert-R-tree • Analytical results • Formally compare R*-tree with alternatives

Chapter 3: Data Storage and Access Methods

Chapter 3: Data Storage and Access Methods

Presentation Transcript

Chapter 3

VB .NET Database Access

Types of Attacks, Hackers Motivations and Methods

Chapter 5 Data Link Layer

Chapter 5

Chapter 6

CMPT 454

Storage codes: Managing Big Data with Small Overheads

CMSC424: Database Design

Data Mining Chapter 4 Algorithms: The Basic Methods

Data Mining: Concepts and Techniques — Chapter 2 —

Data Mining: Concepts and Techniques — Chapter 2 —

CHAPTER 1 INTRODUCTION TO COMPUTER SYSTEM

Chapter 9 Technology of a database server

Part Three-- Storage Management

Chapter 6

Data Management

Chapter 7. Cluster Analysis

Chapter 9 Technology of a database server

Chapter 11: Storage and File Structure

Secondary Storage Management