200 likes | 492 Views
Indexing. Index Concept. Main idea: A separate data structure used to locate records Many, many, many, many flavors of index organization have been proposed and tried including structures which combine hashing and indexing Various buffering schemes are somewhat orthogonal We’ll focus on
E N D
Index Concept • Main idea: A separate data structure used to locate records • Many, many, many, many flavors of index organization have been proposed and tried • including structures which combine hashing and indexing • Various buffering schemes are somewhat orthogonal • We’ll focus on • General concepts [chapter 4.3-4.4] • ISAM (indexed sequential) [ch. 5.1] • B-trees and B+ trees [ch.5.2-5.8]
Index Terminology • Most generally, index is a list of value/address pairs • Each pair is an index “entry” • Value is the index “key” • Address will point to a data record, or to a data page • There might be many records on a page • The assumption is that the value/address pair will be much smaller in size than the full record • If index is small, a copy can be maintained in memory! • Permanent disk copy is still needed
Key Terminology • Index key field • Not necessarily the same as the primary DB key of the table! • But called a “key” anyway • Primary index • Key is the primary (DB) key • Only one index per file • Secondary index • Key is not the primary DB key • Could be many indices per file (or none)
More Indexing Terminology • Dense index • One index entry for each record (or page) • Non-dense or sparse index • Less than one index entry for each record • Inverted file: • File which has a dense secondary index • Clustering index • Preserves locality: close index entries refer to close data records • Multilevel indexing • each level is an index to the next level down
Indexing Pitfalls • Index itself is a file • Occupies disk space • Must worry about maintenance, consistency, recovery, etc. • Large indices won't fit in memory • May require multiple seeks to locate record entry
Desiderata for Multilevel Indexes • Should support efficient random access • Should also support efficient sequential access, if possible • Should have low height • Should be efficiently updatable • Should be storage-efficient • Top level(s) should fit in memory
ISAM • = Indexed Sequential Access Method • IBM terminology • “Indexed Sequential” more general term (non-IBM) • ISAM as described in textbook (5.1) is very close to B+ tree • simpler versions exist • Main idea: maintain sequential file but give it an index • Sequentiality for efficient “batch” processing • Index for random record access
ISAM Technique • Build a dense index of the pages (1st level index) • Sparse from a record viewpoint • Then build an index of the 1st level index (2nd level index) • Continue recursively until top level index fits on 1 page • Some implementations may stop after a fixed # of levels
Updating an ISAM File • Data set must be kept sequential • So that it can be processed without the index • May have to rewrite entire file to add records • Could use overflow pages • chained together or in fixed locations (overflow area) • Index is usually NOT updated as records are added! • Once in a while the whole thing is “reorganized” • Data pages recopied to eliminate overflows • Index recreated
ISAM Pros, Cons • Pro • Relatively simple • Great for true sequential access • Cons • Not very dynamic • Inefficient if lots of overflow pages • Can only be one ISAM index per file
B-Tree • B-Tree is a type of multilevel index • from another standpoint: it's a type of balanced tree • Invented in 1972 by Boeing engineers R. Bayer and E. McCreight • By 1979: "the standard organization for indexes in a database system" (Comer)
B-Tree Overview • Assume for now that keys are fixed-length and unique A B-tree can be thought of as a generalized binary search tree • multiple branches rather than just L or R • Trees are always perfectly balanced • Some wasted space in the nodes is tolerated
B-Tree Concepts • Each node contains • tree (index node) pointers, and • key values (with record or page pointers) • Given a key K and the two node pointers L and R around it • All key values pointed to by L are < K • All key values pointed to by R are > K • “Order p” means (up to) p tree pointers, (up to) p-1 keys • Terminology differs between authors
B+ Tree vs. B-tree • Textbook only discusses B+ trees • So do we from now on • Two big differences: • Original B-trees had record pointers in all of the index nodes; B+ trees only in leaf nodes • Given a key K and the two node pointers L and R around it • All key values pointed to by L are < K • All key values pointed to by R are >= K • B+ tree data pages are linked together to form a sequential file • Gives the advantages of ISAM • In our book, it’s a doubly-linked list
Alternate Views of the Leaf Nodes • [cf. Chapter 4.3.1] • Leaf nodes might be actual data pages • Leaf nodes might contain pointers to the actual data records or pages • For B+ trees, this implies the leaf node format is different from the non-leaf node format • may hold different number of entries • The leaf nodes can be chained together, regardless of whether the actual data pages are!
B+Tree Growth and Change The big idea: When a node is full, it splits. • middle value is propagated upward • If we’re lucky, there’s room for it in the level above • two new nodes are at same level as original node • Height of tree increases only when the root splits • A very nice property • This is what keeps the tree perfectly balanced • Recommended: split only “on the way down” • On deletion: two adjacent nodes recombine if both are < half full
Variations • Could redistribute records between adjacent blocks • esp. on deletion (B* tree) • Variable order: accommodate varying key lengths Could store the whole record in the index block • especially if records are few and small • in a B+ tree, this would make sequential access especially efficient
B+ Trees with Other Indices • Suppose you have a B+ tree for the file • Leaf nodes of the index are the actual pages of the file, doubly linked together for sequential access • Suppose you have some secondary indices • What happens when a B+ tree node splits or merges???
Other Forms of Indexing • Bitmap indexes • One index per value (property) of interest • One bit per record • TRUE if record has a particular property • Indexed hash: hash function takes you to an entry in an index • allows physical record locations to change • Clever indexing schemes are useful in optimizing complex queries