Indexing Techniques

Indexing Techniques

The Problem • What can we introduce to make search more efficient? • Indices! • What is an index? … … Indexing Techniques

Anna Paul Tim Page i Page i+1 Definitions • Index: an auxiliary data structure to speed up record retrieval • Search key: the field/s of a table which is/are indexed • Storage: index files that contain index records • Each entry storing • Actual data record • or, search key value k and record ID <k,rid> • or, search key value k and list of records IDs <k,rid list> • Types: ordered and unordered (hash) indices Indexing Techniques

00112233 00112233 Paul Anna 00112235 00112234 Anna Carol 00112236 00112235 Matt Paul 00112238 00112236 Tim Tim 00112237 Carol 00112238 Rob Types of Ordered Indices (1/3) • Assuming ordered data files • Depending on which field is indexed • Primary index: search key is ordering key field • Pointer for each page • Secondary index: search key is non ordering field secondary primary Indexing Techniques

00112233 00112234 00112235 00112233 00112236 00112235 00112237 00112236 00112238 00112238 00112233 Paul 00112234 Anna 00112235 Matt 00112236 Tim 00112237 Carol 00112238 Rob Types of Ordered Indices (2/3) • Depending on the density of index records • Dense index: an index record for each distinct search key value, ie every record • Sparse index: index records for only some search key values • search key value for first record in page • pointer to page dense sparse Indexing Techniques

Anna Carol Matt 00112233 Paul 00112234 Anna 00112234 Rob 00112237 Carol 00112235 Tim 00112235 Matt 00112236 00112237 00112233 Paul 00112238 01112233 Paul 01112233 00112238 Rob 01112236 00112236 Tim 02112236 01112236 Tim 02112236 Tim Types of Ordered Indices (3/3) • Ordering field is nonkey (may have duplicates) • Clustered index • Unclustered index clustered unclustered Indexing Techniques

Indices Exercise • 215 records • 128 bytes/record • 210 bytes/page • ordered file equality search on ordering field, unspanned organization • without an index • with a primary index • on field of size 12 bytes • assume pointer 4 bytes long Indexing Techniques

00112233 00112233 00112234 00112235 00112235 00112236 00112236 00112237 00112233 Paul 00112238 00112234 Anna 00112235 Matt 00112236 Tim 00112237 Carol 00112238 Rob Multi-level Indices (1/2) • If access using first-level index is still expensive • Build a sparse index on the first-level index • Multi-level Index • Fan-out: index blocking factor first-level index second-level index Indexing Techniques

Multi-level Indices (2/2) • 26 index records/page (fan-out) • 215 index records • 1st-level • 29 pages • 2nd-level • 29 index records • 23 pages • 3rd-level • 23 index records • 1 page • 1 <= 215 / (26)t • t = ceil(log26 215 ) = 3 • t = ceil(logfo#index-records) Indexing Techniques

Dynamic multi-level indices • So far assumed indices are physically ordered files • expensive insertions and deletions • Dynamic multi-level indices • B trees • B+ trees Indexing Techniques

Tree-structured Indices • For each node: K1 < K2 < … Kq-1 • For each value X in subtree pointed to by Pi • Ki-1< X < Ki, 1<i<q • X < Ki, i=1 • Ki-1< X, i=q P1 K1 … Ki-1 Pi Ki … Kq-1 Pq X X X Indexing Techniques

… … … … … … … … … B tree • Problems: empty nodes, unbalanced trees • solution: B trees Indexing Techniques

B tree: Definition • Each node: <P1,<K1, Pr1>, P2,…,<Kq-1, Prq-1>, Pq> • Pi tree pointer, Ki search value, Pri data pointer • For each node: K1 < K2 < … Kq-1 • For each value X in subtree pointed to by Pi • Ki-1< X < Ki, 1<i<q • X < Ki, i=1 • Ki-1< X, i=q • Each node at most q pointers • B tree is order q • Each node at least ceil(q/2) tree pointers • except from root • Internal node with p pointers has p-1 values • All leaves at the same level • balanced tree Indexing Techniques

B tree: Example 5 8 ø 1 ø 3 ø ø 6 ø 7 ø ø 9 ø 12 ø tree pointer data pointer ø null pointer Indexing Techniques

B+ tree • Most implementations of B tree are B+ tree • Data pointers only in leaves • more entries in internal nodes than regular B trees • less internal nodes • less levels • faster access Indexing Techniques

150 100 156 101 179 110 B+ tree: Definition • Internal nodes: <P1,K1, P2,…, Pq-1, Kq-1, Pq> • Leaf nodes: <<K1, Pr1>, <K2, Pr2>,…,<Kp-1, Prp-1>, Pnext> • Pri points a data records or block of pointers of such records • leaf order 120 150 180 120 130 180 200 Indexing Techniques

120 150 180 3 100 150 101 156 5 179 11 110 B+ tree: Search • At each level, find smallest Ki larger than search key • Follow associated pointer Pi 100 30 30 35 120 130 180 200 Indexing Techniques

B+ tree: Insert • Nodes may overflow or underflow • Ignoring overflow or underflow • Inserting data record with with search key value k • find leaf node • if k found • add record to file, create indirect block if there isn’t one • add record pointer to indirect block • if k not found • add data record to file • insert record pointer in leaf node (all search keys in order) Indexing Techniques

B+ tree: Delete • Ignoring overflow or underflow • Find leaf node with search key value k • Find data record pointer, delete record • delete index record • and indirect block, if any, if empty Indexing Techniques

100 k < 100 120 150 180 30 30 35 120 130 180 200 100 150 3 156 5 101 11 110 179 B+ tree: Simple Insert • Insert 42 42 Indexing Techniques

100 k < 100 120 150 180 30 30 35 42 120 130 180 200 3 150 100 5 156 101 11 110 179 B+ tree: Leaf Overflow (1/2) • Insert 9 Indexing Techniques

120 150 180 150 100 101 156 110 179 B+ tree: Leaf Overflow (2/2) • first ceil(n/2) in existing node, rest in new leaf node • n=3+1=4 100 k < 100 9 30 3 5 9 11 30 35 42 120 130 180 200 Indexing Techniques

120 150 180 100 150 156 101 179 110 B+ tree: Internal Node Overflow (1/3) • Insert 210, insert 205 100 k < 100 9 30 3 5 9 11 30 35 42 120 130 180 200 210 Indexing Techniques

120 150 180 100 150 156 101 179 110 B+ tree: Internal Node Overflow (2/3) • Leaf Split 100 k < 100 9 30 3 5 9 11 30 35 42 120 130 180 200 205 210 Indexing Techniques

120 180 205 150 100 156 101 179 110 B+ tree: Internal Node Overflow (3/3) 100 150 k < 100 9 30 3 5 9 11 30 35 42 120 130 180 200 205 210 Indexing Techniques

120 150 180 150 100 101 156 179 110 B+ tree: New Root (1/2) • Insert 210, insert 205 120 130 180 200 205 210 Indexing Techniques

150 180 120 205 100 150 101 156 179 110 B+ tree: New Root (2/2) 120 130 180 200 205 210 Indexing Techniques

Index Insert Exercise • Insert 8, 7, 41 9 30 3 5 9 11 30 35 42 Indexing Techniques

B+ tree: Delete • Simple delete case • Underflow case: • redistribute records • coalesce with siblings • update parents Indexing Techniques

150 180 120 205 100 150 101 156 179 110 B+ tree: Simple Delete (1/2) • Delete 110 120 130 180 200 205 210 215 Indexing Techniques

150 120 180 205 150 156 179 B+ tree: Simple Delete (2/2) • Leaf Updated 100 101 120 130 180 200 205 210 215 Indexing Techniques

150 120 180 205 150 156 179 B+ tree: Delete Redistribution (1/2) • Delete 180 100 101 120 130 180 200 205 210 215 Indexing Techniques

150 120 B+ tree: Delete Redistribution (2/2) • Redistribute entries • left or right sibling 179 205 100 101 120 130 150 156 179 200 205 210 Indexing Techniques

150 120 B+ tree: Delete Coalesce (1/4) • Delete 101 179 205 100 101 120 130 150 156 179 200 205 210 215 Indexing Techniques

150 120 B+ tree: Delete Coalesce (2/4) • Leaf updated • No redistribution • sibling coalesce 179 205 100 120 130 150 156 179 200 205 210 215 Indexing Techniques

150 B+ tree: Delete Coalesce (3/4) • Leaf updated • No redistribution • sibling coalesce 179 205 100 120 130 150 156 179 200 205 210 215 Indexing Techniques

150 179 B+ tree: Delete Coalesce (4/4) • Redistribution 205 100 120 130 150 156 179 200 205 210 215 Indexing Techniques

Hashing Techniques

null null h Static Hashing (1/2) • Store records in buckets with overflow chains • Allocate a fixed number of buckets M • Problems: • small M • long overflow chains, slow search-delete-insert Indexing Techniques

Static Hashing (2/2) • Problems: • large M • wasted space, slow scan null null h null Indexing Techniques

Dynamic Hashing • Splitting and coalescing buckets as the database grows-shrinks • One scheme: Extendible Hashing • Hash function generates large values, eg 32 bits • use i bits, change i as database size changes • If overflow, double the number of buckets • use i+1 bits of the hash function • but, expensive: read all pages M and distribute records in 2*M pages • solution: use a directory and double the size of the directory • only split bucket that overflowed Indexing Techniques

2 16 20 A 2 00 2 01 1 B 10 2 11 C 2 3 7 Directory D Buckets Extendible Hashing (1/4) h(18) = 10010 18 Indexing Techniques

Extendible Hashing (2/4) 2 16 20 A 2 00 2 01 1 B 10 h(4) = 00100 2 11 18 C 2 3 7 D Indexing Techniques

Extendible Hashing (3/4) 3 16 A 2 00 2 01 1 B 10 2 11 18 C 2 3 7 D 3 20 4 A1 Indexing Techniques

Extendible Hashing (4/4) 3 • Global Depth • Local Depth • If bucket full: • split bucket • increment LD • If GD=LD • increment GD • double directory 16 A 3 000 2 001 1 B 010 2 011 18 C 100 101 2 110 3 7 D 111 3 20 4 A1 Indexing Techniques

Extendible Hashing: Delete • If deletion make bucket empty • merge with split image • If directory pointers point to same bucket as split image • directory halved Indexing Techniques

Extendible Hashing: Summary • Avoids overflow pages • Directory can get large • Key search requires just 2 page reads • Space utilization fluctuates • 59-90% for uniformly distributed records Indexing Techniques

Extendible Hashing: Exercise • Initially GD = LD = 1 • M = 2 buckets • Hash function: h(k) = k mod 2i • inserts: 14, 18, 22, 3, 9 • deletes 9, 22, 3 1 12 8 1 00 1 01 5 Indexing Techniques

Indexing Techniques