Indexing and Hashing

Indexing and Hashing • Basic Index Concepts • Ordered Indices • B+-Tree Index Files • Hashing • Index Definition in SQL • Multiple-Key Access • Bitmap Indices

search-key pointer Basic Index Concepts • Indexes speed up access to data in a table. • card catalog in a library (author, title, subject) • A set of one or more attributes used to look up records in a table is referred to as a search key. • In the simplest case, an index fileconsists of records of the form: • Each such record is referred to as an index entry.

Basic Index Concepts, Cont. • An index file is a file, and suffers from many of the same problems as a data file, and uses some of the same organization techniques, e.g., pointer chains. • Index files are typically much smaller than the original file. • 10% - 25% is not unusual. • Two kinds of indices, primarily: • Ordered indices - entries are stored in sorted order, based on the search key. • Hash indices – entries are distributed uniformly across “buckets” using a “hash function.”

Index Evaluation Metrics • All indices are not created equal… • In an OLTP environment - Insertion, deletion and update time are important. • In a DSS environment - access time is important: • Point Queries - Records with a specified value in an attribute. • Range Queries - Records with an attribute value in a specified range. • In either case, space used is also important.

Ordered Indices • An index whose search key specifies the sequential order of the data file is a primary index. • Also called clusteringorclustered index. • Search key of a primary index is frequently the primary key. • An index whose search key does not specify the sequential order of the data file is a secondary index. • Also called a non-clustering or non-clustered index. • A sorted data file with a primary index on it is commonly referred to as an index-sequential file.

Dense Index Files • An index that contains an index record for every search-key value in the data file is a dense index. Informative, but misleading

Dense Index Files, Cont. • To locate the record(s) with search-key value K: • Find index record with search-key value K. • Follow pointer from the index record to the data record(s).

Dense Index Files, Cont. • To delete a record: • Locate the record in the data file, perhaps using the above procedure. • Delete the record from the data file (update free/used pointer lists as appropriate). • If the deleted record was the only one with that search-key value, then delete the search-key from the index (similar to data record deletion)

Dense Index Files, Cont. • To insert a record: • Perform an index lookup using the records’ search-key value. • If the search-key value appears in the index, follow the pointer to the data file and insert the record in an appropriate place. • If the search-key value does not appear in the index: • insert the search key into the index file • insert the record into the data file in an appropriate place • assign a pointer to the data record from the index record.

Sparse Index Files • An index that contains index records but only for some search-key values in the data file is a sparse index. • Typically one index entry for each data file block.

Sparse Index Files, Cont. • To locate a record with search-key value K: • Find the index record with largest search-key value <= K. • Search file sequentially from the record to which the index record points.

Sparse Index Files, Cont. • To delete a record: • Locate the record in the data file, perhaps using the above procedure. • Delete the record from the data file (update free/used lists as appropriate). • If the deleted record was the only record with its search-key value, and if an entry for the search key exists in the index, then replace the index entry with the next search-key value in the data file (in search-key order); if the next search-key value already has an index entry, the index entry is simply deleted.

Sparse Index Files, Cont. • To insert a record: (assume the index stores an entry for each data block) • Perform an index lookup using the records’ search-key value. • If the index entry points to a block with free space, then simply insert the record in that block, in sorted order. • If the index entry points to a full block, then allocate a new block and insert the first search-key value appearing in the new block into the index

Sparse Index Files, Cont. • Advantages (relative to dense indices): • Require less space • Less maintenance for insertions and deletions • Disadvantages: • Slower for locating records, especially if there is more than one block per index entry

Multilevel Index • In order to improve performance, an attempt is frequently made to store, i.e., pin, all index blocks in memory. • Unfortunately, sometimes an index is too big to fit into memory. • In such a case, the index can be treated as a sequential file on disk and a sparse index is built on it: • outer index – a sparse index • inner index – sparse or dense index • If the outer index is still too large to fit in main memory, yet another level of index can be created, and so on.

Multilevel Index, Cont. • Indices at all levels might require updating upon insertion or deletion. • Multilevel insertion, deletion and lookup algorithms are simple extensions of the single-level algorithms.

Secondary Indices • So far, our consideration of dense and sparse indices has only been in the context of primary indices. • Recall that an index whose search key does not specify the sequential order of the data file is called a secondary index. • A secondary index can be helpful when a table is searched using a search key other than the one on which the table is sorted. • Suppose account is sorted by account number, but searches are based on branch, or searching for a range of balances. • Suppose payment is sorted by loan# and payment#, but searches are based on id#

Secondary Indices • In a secondary index, each index entry will point to either a: • Single record containing the search key value (candidate key). • Bucket that contains pointers to all records with that search-key value (non-candidate key). • All previous algorithms and data structures can be modified to apply to secondary indices.

Secondary Indexon balance field of account

Index Classification • In summary, the indices we have considered so far are either: • Dense, or • Sparse • In addition, an index may be either: • Primary, or • Secondary • And the search key the index is built on may be either a: • Candidate key • Non-candidate key • Note, that the book claims a secondary index must be dense; why?

Index Performance • Although Indices improve performance, they can also hurt: • All indices must be updated upon insertion or deletion. • Performance degrades as the index file grows (physical order doesn’t match logical order, many overflow blocks get created, etc) consequently periodic reorganization (delete and rebuild) of the index is required. • Scanning a file sequentially in secondary search-key order can be expensive; worst case - each record access may fetch a new block from disk** • Thus, in the worst case, the number of data blocks retrieved when scanning a secondary index for a range query is equal to the number of tuples retrieved.

B+-Tree Index Files • B+-tree indices are a type of multi-level index. • Advantage of B+-tree index files: • Automatically reorganizes itself with small, local, changes. • Index reorganization is still required, but not as frequently*. • Disadvantage of B+-trees - extra time (insertion, deletion) and space overhead. • Advantages outweigh the disadvantages, and they are used extensively – the “gold standard” of index structures.

Example of a B+-tree • The structure of a B+-tree is determined by a parameter n, which determines the number of values and pointers that a node can have. B+-tree for account file (n = 3)

Observations about B+-trees • Each node is typically a disk block: • “logically” close blocks need not be “physically” close. • The value of n is typically determined by: • Block size • Search key size • Pointer size i.e., we squeeze in as many search-keys and pointers as possible.

Example of a B+-tree • All paths from root to leaf in a B+-tree are the same length. • No difference between best-case and worst-case index search. • This in contrast to hashing structures.

Example of a B+-tree • Each node that is not the root must be at least half full: • Between n/2 and n pointers. • Between n/2 – 1and n –1 search key values.

Example of a B+-tree • The root is a special case: • If the root is not a leaf, it can have as few as 2 children (regardless of n). • If the root is a leaf (that is, there are no other nodes in the tree), it can have as few as 1 value and no children.

Another Example B+-tree for account file (n = 5)

B+-Tree Node Structure • Node structure (leaf or internal): • Ki are the search-key values, in sorted order: K1 < K2 < K3 < . . .< Kn–1 • Pi are pointers to: • children, i.e., subtrees, for non-leaf nodes or • records or buckets of records for leaf nodes.

Non-Leaf Nodes in B+-Trees • For a non-leaf node: • All the search-keys in the subtree to which P1 points are less than K1 • For 2  i  n – 1, all the search-keys in the subtree to which Pi points have values greater than or equal to Ki–1 and less than Ki • All the search-keys in the subtree to which Pnpoints are greater than Kn-1

Leaf Nodes in B+-Trees • For a leaf node: • Pointer Pi , where 1<= i <= n–1, either points to a record with search-key Ki, or to a bucket of pointers to records, each having search-key Ki. • Pn points to the next leaf node in search-key order

Queries on B+-Trees • Searching a B+ tree for a given search key value is a straight-forward generalization of searching a binary search tree. • Find all records with a search-key value of k (see page 488): • Start with the root node. • Examine the node for the smallest search-key value > k. • If such a value exists, call it is Kj, then follow Pjto the child node • Otherwise if k Kn–1, then follow Pn to the child node. • If the node is not a leaf, repeat the above procedure on that node. • Eventually reach a leaf node. • If for some i, key Ki = k, follow pointer Pito the desired record or bucket. • Otherwise no record with search-key value k exists.

Queries on B+-trees • Search examples: • Downtown • Newberry • Perryridge • All values between Mianus and Redwood (range query)

Queries on B+-Trees (Cont.) • In processing a query, a path is traversed in the tree from the root to some leaf node. • If there are K search-key values in the file, the path is no longer than logn/2(K). see www.cs.fit.edu/~pbernhar/teaching/cse5660/B-treeAnalysis.pdf • Since a node is generally the same size as a disk block, typically 4 kilobytes, n is typically around 100 (assuming 40 bytes per index entry). • For 1 million search key values and n = 100, at most log50(1,000,000) = 4 nodes are accessed in a lookup. • Contrast this with a balanced binary tree with 1 million search key values; around 20 nodes, i.e., blocks, are accessed.

Queries on B+-Trees (Cont.) • The authors claim (without proof or analysis) that if there are K search-key values in the file, the path is no longer than: logn/2(K) • The analysis from the previous page shows that the path is no longer than: 2 + logd(K/e) where d =n/2 and e =(n-1)/2. • Although the above expressions are different, both are O(logK), since d and e are fixed for any given index.

Updates on B+-Trees: Insertion • Find the leaf node in which the search-key value should appear. • If the search-key value is already in the leaf node: • If the index is on a candidate key field then report an error. • Otherwise (not a candidate key field), the record is added to data file, and • If necessary, a pointer is inserted into the bucket (secondary index on a non-candidate key). • If the search-key value is not in the leaf node: • Add the record to the data file, and • Create a bucket if the index is a secondary index on a non-candidate key field. • Add a pointer to the record in the bucket. • If there is room in the leaf node, insert the (key-value, pointer) pair. • Otherwise, split the node as discussed in the next slide.

Updates on B+-Trees: Insertion (Cont.) • Splitting a (leaf) node: • Examine the n (search-key value, pointer) pairs (including the one being inserted) in sorted order. • Place the first n/2 in the original node, and the rest in a new node. • Let p be a pointer to the new node, and let k be the least key value in the node pointed to by p. • Insert (k,p) in the parent of the node being split; If the parent is full, split it and propagate the split further up (recursively). • Splitting nodes continues up the tree until a node is found that is not full. • In the worst case the root node is split increasing the height of the tree.

Updates on B+-Trees: Insertion (Cont.) B+-Tree before and after insertion of “Clearview”

Updates on B+-Trees: Insertion (Cont.) • Insert the following values into a B+ tree: 2 31 3 29 5 23 7 19 11 17

Updates on B+-Trees: Insertion (Cont.) 31 3 29 5 23 7 19 11 17 2

Updates on B+-Trees: Insertion (Cont.) 5 23 7 19 11 17 2 3 29 31 29

Updates on B+-Trees: Insertion (Cont.) 7 19 11 17 29 31 2 3 5 23 29 5

Indexing and Hashing