Indexing

Indexing

Index Concept • Main idea: A separate data structure used to locate records • Many, many, many, many flavors of index organization have been proposed and tried • including structures which combine hashing and indexing • Various buffering schemes are somewhat orthogonal • We’ll focus on • General concepts [chapter 4.3-4.4] • ISAM (indexed sequential) [ch. 5.1] • B-trees and B+ trees [ch.5.2-5.8]

Index Terminology • Most generally, index is a list of value/address pairs • Each pair is an index “entry” • Value is the index “key” • Address will point to a data record, or to a data page • There might be many records on a page • The assumption is that the value/address pair will be much smaller in size than the full record • If index is small, a copy can be maintained in memory! • Permanent disk copy is still needed

Key Terminology • Index key field • Not necessarily the same as the primary DB key of the table! • But called a “key” anyway • Primary index • Key is the primary (DB) key • Only one index per file • Secondary index • Key is not the primary DB key • Could be many indices per file (or none)

More Indexing Terminology • Dense index • One index entry for each record (or page) • Non-dense or sparse index • Less than one index entry for each record • Inverted file: • File which has a dense secondary index • Clustering index • Preserves locality: close index entries refer to close data records • Multilevel indexing • each level is an index to the next level down

Indexing Pitfalls • Index itself is a file • Occupies disk space • Must worry about maintenance, consistency, recovery, etc. • Large indices won't fit in memory • May require multiple seeks to locate record entry

Desiderata for Multilevel Indexes • Should support efficient random access • Should also support efficient sequential access, if possible • Should have low height • Should be efficiently updatable • Should be storage-efficient • Top level(s) should fit in memory

ISAM • = Indexed Sequential Access Method • IBM terminology • “Indexed Sequential” more general term (non-IBM) • ISAM as described in textbook (5.1) is very close to B+ tree • simpler versions exist • Main idea: maintain sequential file but give it an index • Sequentiality for efficient “batch” processing • Index for random record access

ISAM Technique • Build a dense index of the pages (1st level index) • Sparse from a record viewpoint • Then build an index of the 1st level index (2nd level index) • Continue recursively until top level index fits on 1 page • Some implementations may stop after a fixed # of levels

Updating an ISAM File • Data set must be kept sequential • So that it can be processed without the index • May have to rewrite entire file to add records • Could use overflow pages • chained together or in fixed locations (overflow area) • Index is usually NOT updated as records are added! • Once in a while the whole thing is “reorganized” • Data pages recopied to eliminate overflows • Index recreated

ISAM Pros, Cons • Pro • Relatively simple • Great for true sequential access • Cons • Not very dynamic • Inefficient if lots of overflow pages • Can only be one ISAM index per file

B-Tree • B-Tree is a type of multilevel index • from another standpoint: it's a type of balanced tree • Invented in 1972 by Boeing engineers R. Bayer and E. McCreight • By 1979: "the standard organization for indexes in a database system" (Comer)

B-Tree Overview • Assume for now that keys are fixed-length and unique A B-tree can be thought of as a generalized binary search tree • multiple branches rather than just L or R • Trees are always perfectly balanced • Some wasted space in the nodes is tolerated

B-Tree Concepts • Each node contains • tree (index node) pointers, and • key values (with record or page pointers) • Given a key K and the two node pointers L and R around it • All key values pointed to by L are < K • All key values pointed to by R are > K • “Order p” means (up to) p tree pointers, (up to) p-1 keys • Terminology differs between authors

B+ Tree vs. B-tree • Textbook only discusses B+ trees • So do we from now on • Two big differences: • Original B-trees had record pointers in all of the index nodes; B+ trees only in leaf nodes • Given a key K and the two node pointers L and R around it • All key values pointed to by L are < K • All key values pointed to by R are >= K • B+ tree data pages are linked together to form a sequential file • Gives the advantages of ISAM • In our book, it’s a doubly-linked list

Alternate Views of the Leaf Nodes • [cf. Chapter 4.3.1] • Leaf nodes might be actual data pages • Leaf nodes might contain pointers to the actual data records or pages • For B+ trees, this implies the leaf node format is different from the non-leaf node format • may hold different number of entries • The leaf nodes can be chained together, regardless of whether the actual data pages are!

B+Tree Growth and Change The big idea: When a node is full, it splits. • middle value is propagated upward • If we’re lucky, there’s room for it in the level above • two new nodes are at same level as original node • Height of tree increases only when the root splits • A very nice property • This is what keeps the tree perfectly balanced • Recommended: split only “on the way down” • On deletion: two adjacent nodes recombine if both are < half full

Variations • Could redistribute records between adjacent blocks • esp. on deletion (B* tree) • Variable order: accommodate varying key lengths Could store the whole record in the index block • especially if records are few and small • in a B+ tree, this would make sequential access especially efficient

B+ Trees with Other Indices • Suppose you have a B+ tree for the file • Leaf nodes of the index are the actual pages of the file, doubly linked together for sequential access • Suppose you have some secondary indices • What happens when a B+ tree node splits or merges???

Other Forms of Indexing • Bitmap indexes • One index per value (property) of interest • One bit per record • TRUE if record has a particular property • Indexed hash: hash function takes you to an entry in an index • allows physical record locations to change • Clever indexing schemes are useful in optimizing complex queries

Indexing

Indexing

Presentation Transcript

Indexing

Indexing:

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing