850 likes | 1.67k Views
Chapter 9 Multilevel Indexing and B-Trees. CIS 402:File Management Techniques. 9.1 Introduction : Invention of the B-Tree. Introduction : The Invention of the B-tree.
E N D
Chapter 9Multilevel Indexing and B-Trees CIS 402:File Management Techniques
9.1 Introduction : Invention of the B-Tree Introduction: The Invention of the B-tree • 1972 Acta Infomatica : R. Bayer and E. McCreight (at Boeing Corporation) “Organization and Maintenance of Large Ordered Indexes” • 1979 : ‘de facto’ standard for database index • D.Comer “The Ubiquitous B-tree” ACM Computing Survey • Why the name B-tree? • Balanced, Bushy, Broad, Boeing, Bayer • Retrieval, Insertion, Deletion time = log KI ( I : no of indexes in file, K : no of indexes in a page) • Excellent for dynamically changing random access files
9.2 Statement of the Problem Statement of the Problem • Problems in an index on secondary storage • Searching the index must be faster than binary searching • In binary search: 15 items - 4 seeks, 1,000 items - 9.5 seeks • Insertion and deletion must be as fast as search • inserting a key may involve moving many other keys in some file structures
9.3 Indexing with Binary Search Trees Binary Search Tree • Advantages • Data may not be physically sorted • Good performance on balanced tree • Insert cost = search cost • Disadvantages • In out-of-balance binary tree, more seeks are required
9.3 Indexing with Binary Search Trees At most 4 seeks/one record KF Binary search tree representation FB SD HN WS CL PA AX DE FT JD NR RF TK YJ Balanced Binary Search Tree • Sorted list of keys • AX, CL, DE, FB, FT, HN, JD, KF, NR, PA, RF, SD, TK, YJ
9.3 Indexing with Binary Search Trees KF FB SD HN CL WS PA DE FT JD AX NR RF TK LV LA NP MB ND NK Unbalanced Binary Tree - At most 9 seeks/one record YJ - Worst case : sequential search
9.3 Indexing with Binary Search Trees Paged Binary Tree • Page • A unit of disk I/O for handling seek and transfer of disk data • Typically, 4k, 8k, 16k ... • Paged Binary Tree • Divide a binary tree into pages and then store each page in a block of contiguous locations on disk. • If every page holds 7 keys, 511 nodes(keys) in only three seeks • Performance : # of seeks • A completely full balanced tree : log2 (N+1) • A completely full paged tree : log(k+1) (N+1) • (k : # of keys hold in a single page)
9.3 Indexing with Binary Search Trees Paged Binary Tree Each leaf below itself points to a page 64 more pages, 512 total keys
9.3 Indexing with Binary Search Trees The Problem with Paged Trees • Only valid when we have the entire set of keys in hand before the tree is built • Main Problem: Easily out of balance • How to select a good separator • How to group keys • How to guarantee the maximum loading • B-tree provides a solution for above problems!
9.3 Indexing with Binary Search Trees random input sequence :C S D T A M P I B W N G U R K E H O L J Y Q Z F X V D C S A M U I T W P B G K N R V Y E H J L O Q X Z F Paged Binary Tree (Out of balance)
9.4 Multilevel Indexing : A Better Approach to Tree Indexes Multilevel Indexing • Approach as simple index record • limited on the number of keys allowed • Approach as multirecord index • consists of a sequence of simple index records • binary search is too expensive • Approach as multilevel index • reduced the number of records to be searched • speed up the search <example> 80Mbytes file of 8,000,000 records 10-byte keys
9.4 Multilevel Indexing : A Better Approach to Tree Indexes Example of Multilevel Indexing 4th level index 1 a single index record with 8 keys 1 2 8 3rd level index 1 2 : : 8 1 2 . . . 100 : : 801 800 8 index records to index the largest keys in the 800 second-level records 2nd level index 1 2 : : 9 : : : 800 1 2 . . . 100 : : 901 1000 : : 7901 8000 800 index records with 80,000 keys choose one of the keys in each index record as the key of that whole record Overhead: 809 records to index 80000 Search Time: 3 disk reads (keep top level in memory) Lowest level index (1st level) is an index to data file and its reference fields are record addresses in the data file
Multi-level Indexing • How can we insert new keys into the multilevel index? • The indexing records on some level can get full • Several levels of indexes might need to be be rebuilt • Overflow chain may be helpful, but still ugly • In the example, the index in memory (level 4) has only 8 of 100 slots in use. • In ANSI C++, the STL provides dynamic lists, solving many problems detailed in the text, which predates the STL • Even with the STL, B-trees still provide a perfect data structure to hold the multi-level indexes
9.5 B-Trees:Working up from the bottom B-Trees: Working up from the bottom • Bayer and McCreight, 1972, Acta Infomatica • Build trees upward from the bottom instead of downward from the top • Each node of a B-tree is an index record which consists of “key-reference” pairs • Order: maximum number of key-reference pairs in a node • Each node of a B-Tree represents a page of memory
Formal Definition of B-Tree Properties In a B-Tree of order m, • Every page has a maximum of m descendants • Every page, except for the root and leaves, has at least m/2 descendants. • The root has at least two descendants (unless it is a leaf). • All the leaves appear on the same level. • The leaf level will form a complete, ordered index of the associated data file.
How do B-Trees work? • Each node of a B-Tree: • is an Index Record. • ideally is stored in a single page of memory • has the same maximum number of key-ref. pairs (order) • has a minimum number of key-ref pairs, typically order/2 • ordinarily has one more link pointer than data slots • for indexing, in the text, each node in any level above the leaves has one pointer for each datum • each datum on these levels represents a reference to another index and represents the largest value in that lower index • We will cover regular B-Trees • Discuss how to work with one less pointer • Rightmost entry in rightmost node in each level is largest • never take rightmost-child
B-Tree Animations • Some interesting animations on the web: http://sky.fit.qut.edu.au/~maire/baobab/baobab.html http://www.cs.tcd.ie/Jeremy.Jones/vivio/trees/B-tree.htm • Internet Explorer only • has tutorial of sorts http://www.geocities.com/SiliconValley/Program/2864/File/btree.html
Evans 3 2 1 2 Storing B-Trees in a File • To store a B-Tree, must place in file in a manner in which the tree can be reconstructed. • Solution: Store each node in a list like this... Links Data Node Height #keys Node link points at
Evans 3 2 1 2 Mills -1 -1 1 1 Brown Davis -1 -1 -1 2 1 Example: File Representation Node Height #keys
Evans Brown Mills Actual B-Tree Represented Page 1 Page 3 Page 2 Davis Note that it is ordered (of course) and of order 3
Why not just use a Binary Search Tree? • Problems: • Unbalanced tree requires too many looks • Nodes can’t be made to match page size in any case • Two pointers per node; can’t do multi-level indexing
9.8 B-Tree Methods Search, Insert, and Others Algorithm for Search • Searching procedure • iterative • work in two stages operating alternatively on entire pages (Class BTree) and then within pages (Class BTreeNode) • Step1: Loading a page into memeory • Step 2: Searching through a page, looking for the key along the tree until it reaches the leaf level
B-Tree Insertion: Overview • When inserting a new key into an index record that is not full, update that record • When inserting a new key into an index record that is full, split the record in two, with half of the keys in each half. • The largest key of the split record is promoted • this may cause a new recursive split. • With indexes, promotion is of a key, not, an index record • the promoted key thus appears on at least 2 levels
9.6 Example of Creating a B-Tree B-Tree Insertion:Splitting & Promoting • Splitting • Creation of two nodes out of one because the original node becomes overfull • Result in the need to promote a key to a higher-level node to provide an index separating the two new nodes • Promotion of a key • Movement of a key from one node into a higher-level node when split occurs • Again, the key rises, but not the direct reference
B-tree Insertion : Basic Facts • Major components of insertion • Split the node • Promote the middle key • Increase the height of the B-tree • bottom-up • Insertion may touch no more than 2 nodes per level • Insertion cost is strictly linear in the height of the tree
9.8 B-Tree Methods Search, Insert, and Others Algorithm for Insertion: Prelim • Observations of Insertion, Splitting, and Promotion • proceed all the way down to the leaf level • after finding the insertion location at the leaf level, the work proceeds upward from the bottom • Iterative procedure as having three phases • Search to the leaf level, using FindLeaf method • Insertion, overflow detection, and splitting on the upward path (recursive, working up) • Creation of a new root node, if the current root was split
9.8 B-Tree Methods Search, Insert, and Others Algorithm for Insertion • With no redistribution • Locate node on bottom most level in which to insert record. Location is determined by key search. • If vacant record slot is available, insert the record so that key sequencing is maintained. Then, update the pointer associated with the record (Pointer is null for level 0 records). Then Stop! • If no vacant record slot exists, identify median record. All records and pointers to the left of the median records are stored in one node (the original) and those to the right are stored in another node(the new node).
9.8 B-Tree Methods Search, Insert, and Others Algorithm for Insertion (con’t) • (Step 4) If the topmost node was split, create a new topmost node which contains the median record identified in Step 3, filled with pointers to the original and split nodes. Update the root node to point to the new topmost node. Then Stop! • (Step 5) If topmost node was not split, prepare to insert median record identified in Step 3 and a pointer to the new node (created in Step 3). Then Goto Step 2. • Step 4 increases height of B-tree by 1 level
9.8 B-Tree Methods Search, Insert, and Others B-Tree for File Indexing Insertion Example Insert 1 Insert 3 Insert 19,4,20 4 is on two levels! split 4 3 3 20 4 20 19 2 0 0 1 3 19 4 20 0 1 Insert 13,16 Insert 9 4 20 4 16 20 2 split 2 1 3 13 4 16 20 19 1 19 3 9 20 4 13 16 0 1 3 0 1
9.9 B-Tree Nomenclature B-Tree Nomenclature • Be aware that terms are not uniform in the literature • Definitions are also quite different • In fact, there are a number of B-tree variations • The course text uses “B tree” to describe what is a B+ tree in other books • The course text’s “B+ tree” is a B+ tree with a linked list of sorted data blocks
Other Book Course Text B-Tree N/A C G Root A B E F H I Data Block Data Block Data Block Data Block
Other Book Course Text B+-Tree B-Tree Root C G I A B C E F G H I Data Block Data Block Data Block Data Block
Other Book Our Book B+-Tree with Linked List B+-Tree Root C G I A B C E F G H I Data Block Data Block Data Block Data Block
Another aspect (node structures) Homogeneous Trees :B-Tree in other text • Homogeneous trees -leaf nodes and interior nodes have same structures; Each contains both data pointers and tree pointers • Average search length less for homogeneous trees, because some searches may conclude before reaching a leaf node
B-Tree in other texts 23 pointers to 23 records in data file (4 of 23 records shown) 37 64 45 53 85 91 8 23 1 7 14 20 27 36 70 80 88 95 38 40 50 52 60
Another Aspect (node structures) Heterogeneous Trees :B+-Tree in other text • Heterogeneous trees - leaf nodes and interior nodes have different structures
B+-Tree in other text 23 pointers to 23 records in data file 37 64 45 53 85 91 14 23 1 7 8 14 20 23 27 36 64 70 91 95 80 85 88 37 38 40 45 50 52 53 60
+ Topic B-Tree B -Tree Algorithm Complexity Rather complex more simple for insertion Retrieval less efficiency more efficient efficiency (B-tree is tall & B+-tree is short spindly) & bushy Storage slightly more less efficient efficiency efficient (more space) (less space) 1-pass structure rather complex simple creation algorithms Comparison of B-Tree and B+-Tree in other text
9.10 Formal Definition of B-Tree Properties Formal Definition of B-Tree Properties ** Properties of a B-tree of order m (page==node) • Every page has a maximum of m descendants • Every page, except for the root and the leaves, has at least (m/2) descendants • The root has at least two descendants (unless it is a leaf) • All leaves appear on the same level • The leaf level forms a complete, ordered index of the associated data file
9.11 Worst-Case Search Depth Worst-case Search Depth • Search depth : full depth of the tree • Worst case occurs • When every page of the tree has only the minimum # of descendants • Maximal height with a minimum breadth
9.12 Deletion, Merging, and Redistribution Deletion, Redistribution, and Concatenation • Must ensure that the B-tree properties are maintained after a deletion • Algorithm (with redistribution and cocatenation) • If the key to be deleted is not in a leaf, swap it with its immediate successor, which is in a leaf (might be redistributed or concatenated!) • Delete the key
9.12 Deletion, Merging, and Redistribution Deletion Algorithm(Cont’d) • If underflow occurs (the leaf now contains one too few keys), • If the left or right sibling has more than the minimum number of keys , redistribute • Otherwise, concatenate the two leaves and the median key from the parent into one leaf • Apply above step 3 to the parent as if it were deleted
9.12 Deletion, Merging, and Redistribution Redistribution • Occurs when a sibling has more than the minimum # of keys • Idea: Move keys between siblings to balance tree • Results in change of the key in the parent page • Does not propagate : strictly local effects • How many keys should be moved? • Not necessarily fixed • Even distribution is desired
Redistribution (con’t) • Must fix a shortage. • The possibilities: • Merge sibling nodes and move the center element up • Move an element up from the left sibling into the root, and an element from the root on down • Move an element up from the right sibling into the root, and an element from the root on down
9.12 Deletion, Merging, and Redistribution Concatenation (merge) • Occurs in case of underflow (page less than ½-full) • Combine the two partial pages and the correct key from the parent page make a single full page • Reverses splitting • Concatenation must involve demotion of keys : • may cause underflow in the parent page • The effects propagate upward • The depth of the tree may be reduced by one.
9.12 Deletion, Merging, and Redistribution Deletion Examples Figure A P Z I P G I M X Z D T B C K L M R S T Z A D J Q Y E G I O P V W X F H N U
9.12 Deletion, Merging, and Redistribution Deletion Example 1 Removal of key C from figureA: Change occurs only in leaf node P Z I B C A D P G I M X Z D T B D K L M R S T Z A J Q Y E G I O P V W X F H N U
9.12 Deletion, Merging, and Redistribution Deletion Example 2 Result of deleting P from figure A : P changes to O in the second level and the root O Z I O F I M X Z D T B C K L M R S T Z A D J Q Y E G I O V W X F H N U
9.12 Deletion, Merging, and Redistribution Deletion Example 3 Result of deleting H from figure A : Removal of H caused an underflow, and two leaf nodes were merged P Z I P I M X Z D T B C K L M R S T Z A D J Q Y E G I O P V W X F N U
Redistribution during Deletion • A way to improve storage utilization • need to do this when rightmost key is deleted • rightmost key occurs in parent node, too! • Tends to make an efficient B-tree in terms of space utilization • Worst case : around 50% • Average case : 67 ~ 69%