R-Tree: Spatial Representation on a Dynamic-Index Structure

R-Tree: Spatial Representation on a Dynamic-Index Structure Advanced Algorithms & Data Structures Lecture Theme 03 – Part I K. A. Mohamed Summer Semester 2006

Overview • Representing and handling spatial data • The R-Tree indexing approach, style and structure • Properties and notions • Searching and inserting index Entry-records • Deleting and updating • Performance analyses • Node splitting algorithms • Derivatives of the R-Trees • Applications

Spatial Database (Ia) • Consider: Given a city map, ‘index’ all university buildings in an efficient structure for quick topological search.

Spatial Database (Ib) • Consider: Given a city map, ‘index’ all university buildings in an efficient structure for quick topological search. Spatial object: Contour (outline) of the area around the building(s). Minimum bounding region (MBR) of the object.

Spatial Database (Ic) • Consider: Given a city map, ‘index’ all university buildings in an efficient structure for quick relational-topological search. MBR of the city neighbourhoods. MBR of the city defining the overall search region.

Spatial Database (II) Notion: To retrieve data items quickly and efficiently according to their spatial locations. • Involves 2D regions. • Need to support 2D range queries. • Multiple return values desired: Answering a query region by reporting all spatial objects that are fully-contained-in or overlapping the query region (Spatial-Access Method – SAM). In general: • Spatial data objects often cover areas in multidimensional spaces. • Spatial data objects are not well-represented by point-location. • An ‘index’ based on an object’s spatial location is desirable.

M E H P T X B D F G I K L N O Q S V W Y Z The Indexing Approach • A B-Tree (Rosenberg & Snyder, 1981) is an ordered, dynamic, multi-way structure of order m (i.e. each node has at most m children). • The keys and the subtrees are arranged in the fashion of a search tree. • Each node may contain a large number of keys, and the number of subtrees in each node, then, may also be large. • The B-Tree is designed (among other objectives): • to branch out this large number of directions, and • to contain a lot of keys in each node so that the height of the tree is relatively short.

The R-Tree Index Structure • An R-Tree is a height-balanced tree, similar to a B-Tree. • Index records in the leaf nodes contain pointers to the actual spatial-objects they represent. • Leaves in the structure all appear on the same level. • Spatial searching requires visiting only a small number of nodes. • The index is completely dynamic: inserts and deletes can be intermixed with searches. • No periodic reorganisation is required.

The R-Tree Index Structure • A spatial database consists of a collection of tuples representing spatial objects, known as Entries. • Each Entry has a unique identifier that points to one spatial object, and its MBR; i.e. Entry = (MBR, pointer).

R-Tree Index Structure – Leaf Entries • An entryE in a leaf node is defined as (Guttman, 1984): E = (I, tuple-identifier) • Where I refers to the smallest bindingn-dimensional region (MBR) that encompasses the spatial data pointed to by its tuple-identifier. • I is a series of closed-intervals that make up each dimension of the binding region. Example. In 2D, I = (Ix, Iy), where Ix = [xa, xb], and Iy = [ya, yb].

R-Tree Index Structure – Leaf Entries • In general I = (I0, I1, …, In-1) for n-dimensions, and that Ik = [ka, kb]. • If either ka or kb (or both) are equal to , this means that the spatial object extends outward indefinitely along that dimension.

B c I(A) I(a) I(B) I(b) … I(c) I(M) I(d) d b a R-Tree Index Structure – Non-Leaf Entries • An entryE in a non-leaf node is defined as: E = (I, child-pointer) • Where the child-pointer points to the child of this node, and I is the MBR that encompasses all the regions in the child-node’s pointer’s entries.

Properties Let M be the maximum number of entries that will fit in one node. Let m≤ M/2 be a parameter specifying the minimum number of entries in one node. Then an R-Tree must satisfy the following properties: • Every leaf node contains between m and M index records, unless it is the root. • For each index-record Entry (I, tuple-identifier) in a leaf node, I is the MBR that spatially contains the n-dimensional data object represented by the tuple-identifier. • Every non-leaf node has between m and M children, unless it is the root. • For each Entry (I, child-pointer) in a non-leaf node, I is the MBR that spatially contains the regions in the child node. • The root has two children unless it is a leaf. • All leaves appear on the same level.

Node Overflow and Underflow • A Node-Overflow happens when a new Entry is added to a fully packed node, causing the resulting number of entries in the node to exceed the upper-boundM. • The ‘overflow’ node must be split, and all its current entries, as well as the new one, consolidated for local optimum arrangement. • A Node-Underflow happens when one or more Entries are removed from a node, causing the remaining number of entries in that node to fall below the lower-boundm. • The underflow node must be condensed, and its entries dispersed for global optimum arrangement.

Search Typical query: Find and report all university building sites that are within 5km of the city centre. Key: a: FAW b: Uni-Klinikum c: Psychologie d: Biotechnology e: Institutsviertel f: Rektorat g: Uni-zentrum h: Biology / Botanischer i: Sportszentrum Approach: • Build the R-Tree using rectangular regions a, b, … i. • Formulate the query range Q. • Query the R-Tree and report all regions overlappingQ.

Search Strategy Let Q be the query region. Let T be the root of the R-Tree.  Search all entry-records whose regions overlapsQ. Search sub-trees: • If T is not leaf, then apply Search on ever child-node entry E whose I overlaps Q. Search leaf nodes: • If T is leaf, then check each entry E in the leaf and return E if E.I overlaps Q.

@ r6 r2 r5 r8 @ r2 @ r5 @ r8 r0 r1 r7 r3 r4 @ r0 @ r1 @ r7 @ r3 @ r4 a d b f h c g e i Search Typical query: Find and report all university buildings/sites that are within 5km of the city centre. R-Tree settings: M = m =

Search • The search algorithm descends the tree from the root in a manner similar to a B-Tree. • More than one subtree under a node visited may need to be searched. • Cannot guarantee good worst-case performance. • Countered by the algorithms during insertion, deletion, and update that maintain the tree in a form that allows the search algorithm to eliminate irrelevant regions of the indexed space. • So that only data near the search area need to be examined. • Emphasis is on the optimal placement of spatial objects with respect to the spatial location of other objects in the structure.

Insertion • New index entry-records are added to the leaves. • Nodes that overflow are split, and splits propagate up the tree. • A split-propagation may cause the tree to grow in height. The mainInsertroutine • Let E = (I, tuple-identifier) be the new entry to be inserted. • Let T be the root of the R-Tree. • [Ins_1]Locate a leafL starting from T to insert E. • [Ins_2]AddE to L. If L is already full (overflow), splitL into L and L’. • [Ins_3]Propagate MBR changes (enlarged or reduced) upwards. • [Ins_4]Growtreetaller if node split propagation causes T to split.

B @rN A A B C E.I C rN Insertion – Choose Leaf  [Ins_1]Locate a leaf L starting from T to insert E = (I, tuple-identifier). • Notion (i): Select the path that would require the least enlargement to include E.I. • Notion (ii): Resolve ties by choosing the child-node with the smallest MBR. • Invoke: L = ChooseLeaf(E, T).

Insertion – Choose Leaf Algorithm: ChooseLeaf (E, N) Inputs: (i) Entry E = (I, tuple-identifier), (ii) A valid R-Tree node N. Output: The leaf L where E should be inserted. • IfN is leaf ThenReturnN • Let FS be the set of current entries in the node N • Let F =(I, child-pointer)  FS, so that F.I satisfies the Insertion-Notions • ReturnChooseLeaf (E, F.child-pointer)

Insertion – Add and Adjust  [Ins_2]AddE to L. • Notion (i): If L has room for another entry, install E. • Notion (ii): Otherwise splitL to obtain L and L’, which between them, will contain all previous entries in L and the new E (consolidated for local optima).  [Ins_3]Propagate MBR changes upwards by invoking AdjustTree (L, L’). • Notion (i): Ascend from leaf L to the root T while adjusting the covering rectangles MBR. • Notion (ii): If L’ exists, propagate node splits as necessary; i.e. attempt to install a new entry in the parent of L to point to L’.

@G K @K X Y Z @Y a b c Insertion – Propagating Node Splits Example. Found L = @Y to insert new E = e. R-Tree settings: M = 3, m = 1.

Insertion – Adjust Tree (I) Algorithm: AdjustTree (N, N’) Inputs: (i) A node N that has had its contents modified, (ii) The resultant split node N’, if not NULL, that accompanies N. Outputs: (i) N as above, (ii) N’ as above. • IfN is the root ThenReturn {(i) N, (ii) N’} • Let PN be the parent node of N. • Let EN = (I_N, child-pointer_N) in PN, where child-pointer_N points to N. • AdjustI_N so that it tightly encloses all entry regions in N.

Insertion – Adjust Tree (II) • IfN’ is Not NULL Then • If number of entries in PN < M-1 Then • Create a new Entry EN’ = (I_N’, child-pointer_N’) • Install EN’ in PN • ReturnAdjustTree (PN, NULL) • Else • Set {PN, PN’} = SplitNode(PN, EN’) • ReturnAdjustTree (PN, PN’) • End If • Else • ReturnAdjustTree (PN, NULL) • End If

@T (root) A B C @C @C’ E G F H Insertion – Grow Tree Taller  [Ins_4]Grow Tree taller. • Notion: If the recursive node split propagation causes the root to split, then create a new root whose children are the two resulting nodes.

Insertion – Summary • The height of the R-Tree containing n entry-records is at most logmn – 1, because the branching factor of each node is at least m. • The maximum number of nodes is: • Worst case space utilisation for all nodes except the root is: • Nodes will tend to have more than m entries, and this will:

Overview • Representing and handling spatial data • The R-Tree indexing approach, style and structure • Properties and notions • Searching and inserting index Entry-records • Deleting and updating • Performance analyses • Node splitting algorithms • Derivatives of the R-Trees • Applications

Deletion • Current index entry-records are removed from the leaves. • Nodes that underflow are condensed, and its contents redistributed appropriately throughout the tree. • A condense propagation may cause the tree to shorten in height. The main Delete routine • Let E = (I, tuple-identifier) be a current entry to be removed. • Let T be the root of the R-Tree. • [Del_1]Find the leafL starting from T that contains E. • [Ins_2]RemoveE from L, and condense ‘underflow’ nodes. • [Ins_3]Propagate MBR changes upwards. • [Ins_4]Shortentree if T contains only 1 entry after condense propagation.

Deletion – Find Leaf  [Del_1]Find the leafL starting from T that contains E. Algorithm: FindLeaf (E, N) Inputs: (i) Entry E = (I, tuple-identifier), (ii) A valid R-Tree node N. Output: The leaf L containing E. • IfN is leaf Then • IfN contains EThenReturnN • ElseReturn NULL • Else • Let FS be the set of current entries in N. • For each F = (I, child-pointer)  FS where F.I overlaps E.IDo • Set L = FindLeaf (E, F.child-pointer) • IfL is not NULL ThenReturnL • NextF • Return NULL • End If

Deletion – Remove and Adjust  [Del_2]RemoveE from L, and condense ‘underflow’ nodes.  [Del_3]Propagate MBR changes upwards. • Notion (i): Ascend from leaf L to root T while adjusting covering rectangles MBR. • Notion (ii): If after removing the entry E in L and the number of entries in L becomes fewer than m, then the node L has to be eliminated and its remaining contents relocated.

Deletion – Remove and Adjust • Propagate these notions upwards by invoking CondenseTree (N, QS), where N is an R-Tree node whose entries have been modified, and QS is the set of eliminated nodes. • Start the propagation by setting N = L, and QS = . • Re-insert the entries from the eliminated nodes in QS back into the tree. • Entries from eliminated leaf nodes are re-inserted as new entries using the Insert routine discussed earlier. • Entries from higher-level nodes must be placed higher in the tree so that leaves of their dependent subtrees will be on the same level as the leaves on the main tree.

@ r6 @ r7 @ r2 r0 r2 r3 r4 r1 r6 r5 @ r0 @ r3 @ r4 @ r5 @ r1 a k i f c d g b j l h m e n Deletion – Propagating Node Condensation • Example: Delete the index entry-record b. R-Tree settings: M = 4, m = 2. • Spatial constraint: a.I will form smallest MBR with r4.

Deletion – Condense Tree (I) Algorithm: CondenseTree (N, QS) Inputs: (i) A node N whose entries have been modified, (ii) A set of eliminated nodes QS. • IfN is NOT the root Then • Let PN be the parent node of N. • Let EN = (I_N, child-pointer_N) in PN. • IfN.entries < mThen • Delete EN from PN • Add N to QS • Else • Adjust I_N so that it tightly encloses all entry regions in N. • End If • CondenseTree (PN, QS)

Deletion – Condense Tree (II) • Else If N is root AND Q is NOT  Then • For each QQSDo • For each EQDo • IfQ is leaf ThenInsert (E) • ElseInsert (E) as a node entry at the same node level as Q • End If • NextE • NextQ • End If

Deletion – Summary Why ‘re-insert’ orphaned entries? • Alternatively, like the delete routine in B-Tree (Rosenberg & Snyder, 1981), an ‘underflow’ node can be merged with whichever adjacent sibling that will have its area increased the least, or its entries re-distributed among sibling nodes. • Both methods can cause the nodes to split. • Eventually all changes need to be propagated upwards, anyway.  Re-insertion accomplishes the same thing, and: • It is simpler to implement (and at comparable efficiency). • It incrementally refines the spatial structure of the tree. • It prevents gradual deterioration if each entry was located permanently under the same parent node.

Performance with respect to Parameter m • A high value of m, nearer to M, is useful when the underlying database represented by the R-Tree is mostly used for search inquiries with very few updates. • The height of the tree will be kept to a minimum. • High search performance is maintained. • However, the risk of overflow and underflow is also high. • A small value of m is good when frequent updates and modifications of the underlying database is required. • The nodes are less dense. • Maintenance is less costly.

R-Tree: Spatial Representation on a Dynamic-Index Structure