320 likes | 331 Views
Explore the significance of B-Trees in database operations and learn how to minimize disk access time for faster data retrieval. Discover key properties and benefits of B-Trees.
E N D
CSE 326Killer Bee-Trees Where was that root? David Kaplan Dept of Computer Science & Engineering Autumn 2001
Beyond Binary Trees • One of the most important applications for search trees is databases • If the DB is small enough to fit into RAM, almost any scheme for balanced trees (e.g. AVL) is okay 2000 (WalMart) RAM – 1,000 MB (gigabyte) DB – 1,000,000 MB (terabyte) 1980 RAM – 1MB DB – 100 MB gap between disk and main memory growing! CSE 326 Autumn 2001 2
Time Gap • For many corporate and scientific databases, the search tree must mostly be on disk • Accessing disk 200,000x slower than RAM! • Visiting node accessing disk • Even perfectly balanced binary trees a disaster! log2( 10,000,000 ) = 24 disk accesses • What matters here? The relative speeds of memory and disk operations Goal: Decrease Height of Tree CSE 326 Autumn 2001 3
Food Pyramidfor Silicon-Based Lifeforms cache on-chip silicon speed memory off-chip silicon “seams” disk ferro-magnetic space CSE 326 Autumn 2001 4
Interpreting the Food Pyramid Speed becomes plentiful as we move up the pyramid Space gets (dirt) cheap as we move down the pyramid Physical seams • Disk access ~105 x slower than memory access • Memory access ~102 x slower than cache access Computational seam • Serialization/de-serialization can impose ~10x penalty Seams argue for data structures like B-Trees, because seams make the constants matter! CSE 326 Autumn 2001 5
M-ary Search Tree • Maximum branching factor of M • Complete tree has depth = logMN • Each internal node in a complete tree has M - 1 keys runtime: CSE 326 Autumn 2001 6
3 7 12 21 B-Trees • B-Trees are specialized M-ary search trees • Each node has many keys • subtree between two keys x and y contains values v such that x v < y • binary search within a node to find correct subtree • Each node takes one full {page, block, line} of memory x<3 3x<7 7x<12 12x<21 21x CSE 326 Autumn 2001 7
B-Tree Properties‡ ‡Technically, B+-Trees • Properties • maximum branching factor of M • the root has between 2 and M children or at most L keys • other internal nodes have between M/2 and M children • internal nodes contain only search keys (no data) • smallest datum between search keys x and y equals x • each (non-root) leaf contains between L/2 andL keys • all leaves are at the same depth • Result • tree is (log n) deep • all operations run in (log n) time • operations pull in about M or L items at a time CSE 326 Autumn 2001 8
B-Tree Properties • Properties • maximum branching factor of M • the root has between 2 and M children or at most L keys • other internal nodes have between M/2 and M children • internal nodes contain only search keys (no data) • smallest datum between search keys x and y equals x • each (non-root) leaf contains between L/2 andL keys • all leaves are at the same depth • Result • tree is (log n) deep • all operations run in (log n) time • operations pull in about M or L items at a time CSE 326 Autumn 2001 9
B-Tree Properties • Properties • maximum branching factor of M • the root has between 2 and M children or at most L keys • other internal nodes have between M/2 and M children • internal nodes contain only search keys (no data) • smallest datum between search keys x and y equals x • each (non-root) leaf contains between L/2 andL keys • all leaves are at the same depth • Result • tree is (log n) deep • all operations run in (log n) time • operations pull in about M or L items at a time CSE 326 Autumn 2001 10
B-Tree Properties • Properties • maximum branching factor of M • the root has between 2 and M children or at most L keys • other internal nodes have between M/2 and M children • internal nodes contain only search keys (no data) • smallest datum between search keys x and y equals x • each (non-root) leaf contains between L/2 andL keys • all leaves are at the same depth • Result • tree is (log n) deep • all operations run in (log n) time • operations pull in about M or L items at a time CSE 326 Autumn 2001 11
When constants matter, or …When Big-O is Not Enough B-Tree is about logM/2n/(L/2) deep = logM/2n - logM/2L/2 = O(logM/2n) = O(logn) steps per operation (same as BST!) Where’s the beef?! log2( 10,000,000 ) = 24 disk accesses log200/2( 10,000,000 ) < 4 disk accesses Recall: disk access ~ 200,000x slower than RAM access, so 4 vs. 24 disk accesses is huge! CSE 326 Autumn 2001 12
k2 … __ __ k1 k2 k1 … … kj … __ ki __ B-Tree Node Structure Internal node • i search keys; i+1 subtrees; M - i - 1 inactive entries 1 2 i M - 1 Leaf • j data keys; L - j inactive entries 1 2 j L CSE 326 Autumn 2001 13
26 25 1 2 3 5 6 9 10 11 12 15 17 20 30 50 70 32 33 36 40 42 50 60 30 10 40 3 15 20 Example B-Tree B-Tree with M = 4 and L = 4 CSE 326 Autumn 2001 14
The empty B-Tree 3 3 14 M = 3 L = 2 Making a B-Tree Insert(3) Insert(14) Now, Insert(1)? CSE 326 Autumn 2001 15
14 1 3 14 3 14 Splitting the Root Too many keys in a leaf! 1 3 14 And create a new root Insert(1) 1 3 14 So, split the leaf. CSE 326 Autumn 2001 16
14 1 59 59 14 1 3 14 59 1 3 14 14 14 14 59 26 14 3 26 14 3 1 26 Insertions and Split Ends Too many keys in a leaf! Insert(26) Insert(59) 59 So, split the leaf. And add a new child
14 59 3 14 26 3 59 59 1 3 14 26 59 14 14 26 14 1 1 14 5 59 14 26 5 1 59 3 59 59 5 1 3 5 5 5 59 Propagating Splits Insert(5) Add new child 5 Too many keys in an internal node! Create a new root So, split the node.
Insert the key in its leaf If the leaf ends up with L+1 items, overflow! Split the leaf into two nodes: original with (L+1)/2 items new one with (L+1)/2 items Add the new child to the parent If the parent ends up with M+1 items, overflow! If an internal node ends up with M+1 items, overflow! Split the node into two nodes: original with (M+1)/2 items new one with (M+1)/2 items Add the new child to the parent If the parent ends up with M+1 items, overflow! If root overflows, split it in two, hang new nodes under a new root Only way tree height can increase B-Tree Insertion CSE 326 Autumn 2001 19
5 14 89 5 1 3 5 14 26 59 59 89 1 3 5 14 26 59 79 59 14 After More Routine Inserts Insert(89) Insert(79) CSE 326 Autumn 2001 20
Delete(59) 89 5 1 3 5 14 26 59 79 59 89 14 89 5 1 3 5 14 26 79 79 89 14 Deletion CSE 326 Autumn 2001 21
26 89 79 5 1 3 5 14 26 79 79 89 14 89 89 1 ? 1 79 26 14 3 3 3 3 14 89 79 79 14 14 89 Deletion and Adoption A leaf has too few keys! Delete(5) So, borrow from a neighbor
79 89 ? 3 1 3 14 26 79 79 89 89 89 14 79 79 26 14 1 89 14 1 79 14 26 14 89 Deletion with Propagation A leaf has too few keys! Delete(3) And no neighbor with surplus! But now a node has too few subtrees! So, delete the leaf
Adopt a neighbor 79 89 1 14 26 79 79 89 89 14 1 14 26 79 89 14 Finishing the Propagation (More Adoption) CSE 326 Autumn 2001 24
79 89 14 1 14 26 79 89 89 26 14 26 79 89 79 A Bit More Adoption Delete(1) (adopt a neighbor) CSE 326 Autumn 2001 25
79 89 26 14 26 79 89 79 14 79 79 89 89 14 79 89 79 89 79 14 89 89 Pulling out the Root A leaf has too few keys! And no neighbor with surplus! So, delete the leaf Delete(26) But now the root has just one subtree! A node has too few subtrees and no neighbor with surplus! Delete the leaf
14 79 79 89 89 14 79 79 89 89 Pulling out the Root (cont.) The root has just one subtree! Just make the one child the new root! But that’s silly!
Remove the key from its leaf If the leaf ends up with fewer than L/2 items, underflow! Adopt data from a neighbor; update the parent If borrowing won’t work, delete node and divide keys between neighbors If the parent ends up with fewer thanM/2 items, underflow! B-Tree Deletion Why will dumping keys always work if borrowing doesn’t? CSE 326 Autumn 2001 28
If a node ends up with fewer than M/2 items, underflow! Adopt subtrees from a neighbor; update the parent If borrowing won’t work, delete node and divide subtrees between neighbors If the parent ends up with fewer than M/2 items, underflow! If the root ends up with only one child, make the child the new root of the tree Only way tree height can decrease B-Tree Deletion, part Bee CSE 326 Autumn 2001 29
Thinking about B-Trees • B-Tree insertion can cause (expensive) splitting and propagation • Deletion can cause (cheap) borrowing or (expensive) propagation • Repeated insertions and deletion can cause thrashing • Propagation is rare if M and L are large (Why?) • Disk reads/writes occur on: • Find, but finds will often get filled from virtual memory • Insert or Delete, but usually affects only the leaf node (unless we over/underflow) • Overflow or underflow, but we size M and L to make these rare Here’s the beef! M = L = 128, h = 4 1285 = 30+ billion items! M = L = 128, h = 5 1286 = 4+ trillion items! … and, disk accesses are bounded by height! CSE 326 Autumn 2001 30
Where was that root? Flavors of B- B-Trees with M=3, L=x are called 2-3 trees B-Trees withM=4,L=x are called 2-3-4 trees (internal nodes have 2, 3 or 4 children) Useful for achieving logM(n) bound, prototyping Disk-based systems typically require much larger M,L Base M,L on {page, block, line} and key sizes Branching factors from 50 – 2000 are common Paged out to disk, I think … CSE 326 Autumn 2001 31
Dictionary Alternatives • BST: fast finds, inserts, and deletes O(log n) on average (if data is random!) • AVL trees: guaranteed O(log n) operations • Splay trees: guaranteed amortized O(log n) operations • B-Trees: guaranteed O(logMn) • Shallower depth better for disk-based databases • Hairy1 in practice: many cases, intricate code • What would be even better? • How about O(1) finds and inserts? 1hairy - Slang. Fraught with difficulties; hazardous: a hairy escape; hairy problems. CSE 326 Autumn 2001 32