1 / 123

Indexing and Hashing

Indexing and Hashing. Basic Index Concepts Ordered Indices B + -Tree Index Files Hashing Index Definition in SQL Multiple-Key Access Bitmap Indices. search-key. pointer. Basic Index Concepts. Indexes speed up access to data in a table.

Download Presentation

Indexing and Hashing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indexing and Hashing • Basic Index Concepts • Ordered Indices • B+-Tree Index Files • Hashing • Index Definition in SQL • Multiple-Key Access • Bitmap Indices

  2. search-key pointer Basic Index Concepts • Indexes speed up access to data in a table. • card catalog in a library (author, title, subject) • A set of one or more attributes used to look up records in a table is referred to as a search key. • In the simplest case, an index fileconsists of records of the form: • Each such record is referred to as an index entry.

  3. Basic Index Concepts, Cont. • An index file is a file, and suffers from many of the same problems as a data file, and uses some of the same organization techniques, e.g., pointer chains. • Index files are typically much smaller than the original file. • 10% - 25% is not unusual. • Two kinds of indices, primarily: • Ordered indices - entries are stored in sorted order, based on the search key. • Hash indices – entries are distributed uniformly across “buckets” using a “hash function.”

  4. Index Evaluation Metrics • All indices are not created equal… • In an OLTP environment - Insertion, deletion and update time are important. • In a DSS environment - access time is important: • Point Queries - Records with a specified value in an attribute. • Range Queries - Records with an attribute value in a specified range. • In either case, space used is also important.

  5. Ordered Indices • An index whose search key specifies the sequential order of the data file is a primary index. • Also called clusteringorclustered index. • Search key of a primary index is frequently the primary key. • An index whose search key does not specify the sequential order of the data file is a secondary index. • Also called a non-clustering or non-clustered index. • A sorted data file with a primary index on it is commonly referred to as an index-sequential file.

  6. Dense Index Files • An index that contains an index record for every search-key value in the data file is a dense index. Informative, but misleading

  7. Dense Index Files, Cont. • To locate the record(s) with search-key value K: • Find index record with search-key value K. • Follow pointer from the index record to the data record(s).

  8. Dense Index Files, Cont. • To delete a record: • Locate the record in the data file, perhaps using the above procedure. • Delete the record from the data file (update free/used pointer lists as appropriate). • If the deleted record was the only one with that search-key value, then delete the search-key from the index (similar to data record deletion)

  9. Dense Index Files, Cont. • To insert a record: • Perform an index lookup using the records’ search-key value. • If the search-key value appears in the index, follow the pointer to the data file and insert the record in an appropriate place. • If the search-key value does not appear in the index: • insert the search key into the index file • insert the record into the data file in an appropriate place • assign a pointer to the data record from the index record.

  10. Sparse Index Files • An index that contains index records but only for some search-key values in the data file is a sparse index. • Typically one index entry for each data file block.

  11. Sparse Index Files, Cont. • To locate a record with search-key value K: • Find the index record with largest search-key value <= K. • Search file sequentially from the record to which the index record points.

  12. Sparse Index Files, Cont. • To delete a record: • Locate the record in the data file, perhaps using the above procedure. • Delete the record from the data file (update free/used lists as appropriate). • If the deleted record was the only record with its search-key value, and if an entry for the search key exists in the index, then replace the index entry with the next search-key value in the data file (in search-key order); if the next search-key value already has an index entry, the index entry is simply deleted.

  13. Sparse Index Files, Cont. • To insert a record: (assume the index stores an entry for each data block) • Perform an index lookup using the records’ search-key value. • If the index entry points to a block with free space, then simply insert the record in that block, in sorted order. • If the index entry points to a full block, then allocate a new block and insert the first search-key value appearing in the new block into the index

  14. Sparse Index Files, Cont. • Advantages (relative to dense indices): • Require less space • Less maintenance for insertions and deletions • Disadvantages: • Slower for locating records, especially if there is more than one block per index entry

  15. Multilevel Index • In order to improve performance, an attempt is frequently made to store, i.e., pin, all index blocks in memory. • Unfortunately, sometimes an index is too big to fit into memory. • In such a case, the index can be treated as a sequential file on disk and a sparse index is built on it: • outer index – a sparse index • inner index – sparse or dense index • If the outer index is still too large to fit in main memory, yet another level of index can be created, and so on.

  16. Multilevel Index, Cont. • Indices at all levels might require updating upon insertion or deletion. • Multilevel insertion, deletion and lookup algorithms are simple extensions of the single-level algorithms.

  17. Secondary Indices • So far, our consideration of dense and sparse indices has only been in the context of primary indices. • Recall that an index whose search key does not specify the sequential order of the data file is called a secondary index. • A secondary index can be helpful when a table is searched using a search key other than the one on which the table is sorted. • Suppose account is sorted by account number, but searches are based on branch, or searching for a range of balances. • Suppose payment is sorted by loan# and payment#, but searches are based on id#

  18. Secondary Indices • In a secondary index, each index entry will point to either a: • Single record containing the search key value (candidate key). • Bucket that contains pointers to all records with that search-key value (non-candidate key). • All previous algorithms and data structures can be modified to apply to secondary indices.

  19. Secondary Indexon balance field of account

  20. Index Classification • In summary, the indices we have considered so far are either: • Dense, or • Sparse • In addition, an index may be either: • Primary, or • Secondary • And the search key the index is built on may be either a: • Candidate key • Non-candidate key • Note, that the book claims a secondary index must be dense; why?

  21. Index Performance • Although Indices improve performance, they can also hurt: • All indices must be updated upon insertion or deletion. • Performance degrades as the index file grows (physical order doesn’t match logical order, many overflow blocks get created, etc) consequently periodic reorganization (delete and rebuild) of the index is required. • Scanning a file sequentially in secondary search-key order can be expensive; worst case - each record access may fetch a new block from disk** • Thus, in the worst case, the number of data blocks retrieved when scanning a secondary index for a range query is equal to the number of tuples retrieved.

  22. B+-Tree Index Files • B+-tree indices are a type of multi-level index. • Advantage of B+-tree index files: • Automatically reorganizes itself with small, local, changes. • Index reorganization is still required, but not as frequently*. • Disadvantage of B+-trees - extra time (insertion, deletion) and space overhead. • Advantages outweigh the disadvantages, and they are used extensively – the “gold standard” of index structures.

  23. Example of a B+-tree • The structure of a B+-tree is determined by a parameter n, which determines the number of values and pointers that a node can have. B+-tree for account file (n = 3)

  24. Observations about B+-trees • Each node is typically a disk block: • “logically” close blocks need not be “physically” close. • The value of n is typically determined by: • Block size • Search key size • Pointer size i.e., we squeeze in as many search-keys and pointers as possible.

  25. Example of a B+-tree • All paths from root to leaf in a B+-tree are the same length. • No difference between best-case and worst-case index search. • This in contrast to hashing structures.

  26. Example of a B+-tree • Each node that is not the root must be at least half full: • Between n/2 and n pointers. • Between n/2 – 1and n –1 search key values.

  27. Example of a B+-tree • The root is a special case: • If the root is not a leaf, it can have as few as 2 children (regardless of n). • If the root is a leaf (that is, there are no other nodes in the tree), it can have as few as 1 value and no children.

  28. Another Example B+-tree for account file (n = 5)

  29. B+-Tree Node Structure • Node structure (leaf or internal): • Ki are the search-key values, in sorted order: K1 < K2 < K3 < . . .< Kn–1 • Pi are pointers to: • children, i.e., subtrees, for non-leaf nodes or • records or buckets of records for leaf nodes.

  30. Non-Leaf Nodes in B+-Trees • For a non-leaf node: • All the search-keys in the subtree to which P1 points are less than K1 • For 2  i  n – 1, all the search-keys in the subtree to which Pi points have values greater than or equal to Ki–1 and less than Ki • All the search-keys in the subtree to which Pnpoints are greater than Kn-1

  31. Leaf Nodes in B+-Trees • For a leaf node: • Pointer Pi , where 1<= i <= n–1, either points to a record with search-key Ki, or to a bucket of pointers to records, each having search-key Ki. • Pn points to the next leaf node in search-key order

  32. Queries on B+-Trees • Searching a B+ tree for a given search key value is a straight-forward generalization of searching a binary search tree. • Find all records with a search-key value of k (see page 488): • Start with the root node. • Examine the node for the smallest search-key value > k. • If such a value exists, call it is Kj, then follow Pjto the child node • Otherwise if k Kn–1, then follow Pn to the child node. • If the node is not a leaf, repeat the above procedure on that node. • Eventually reach a leaf node. • If for some i, key Ki = k, follow pointer Pito the desired record or bucket. • Otherwise no record with search-key value k exists.

  33. Queries on B+-trees • Search examples: • Downtown • Newberry • Perryridge • All values between Mianus and Redwood (range query)

  34. Queries on B+-Trees (Cont.) • In processing a query, a path is traversed in the tree from the root to some leaf node. • If there are K search-key values in the file, the path is no longer than logn/2(K). see www.cs.fit.edu/~pbernhar/teaching/cse5660/B-treeAnalysis.pdf • Since a node is generally the same size as a disk block, typically 4 kilobytes, n is typically around 100 (assuming 40 bytes per index entry). • For 1 million search key values and n = 100, at most log50(1,000,000) = 4 nodes are accessed in a lookup. • Contrast this with a balanced binary tree with 1 million search key values; around 20 nodes, i.e., blocks, are accessed.

  35. Queries on B+-Trees (Cont.) • The authors claim (without proof or analysis) that if there are K search-key values in the file, the path is no longer than: logn/2(K) • The analysis from the previous page shows that the path is no longer than: 2 + logd(K/e) where d =n/2 and e =(n-1)/2. • Although the above expressions are different, both are O(logK), since d and e are fixed for any given index.

  36. Updates on B+-Trees: Insertion • Find the leaf node in which the search-key value should appear. • If the search-key value is already in the leaf node: • If the index is on a candidate key field then report an error. • Otherwise (not a candidate key field), the record is added to data file, and • If necessary, a pointer is inserted into the bucket (secondary index on a non-candidate key). • If the search-key value is not in the leaf node: • Add the record to the data file, and • Create a bucket if the index is a secondary index on a non-candidate key field. • Add a pointer to the record in the bucket. • If there is room in the leaf node, insert the (key-value, pointer) pair. • Otherwise, split the node as discussed in the next slide.

  37. Updates on B+-Trees: Insertion (Cont.) • Splitting a (leaf) node: • Examine the n (search-key value, pointer) pairs (including the one being inserted) in sorted order. • Place the first n/2 in the original node, and the rest in a new node. • Let p be a pointer to the new node, and let k be the least key value in the node pointed to by p. • Insert (k,p) in the parent of the node being split; If the parent is full, split it and propagate the split further up (recursively). • Splitting nodes continues up the tree until a node is found that is not full. • In the worst case the root node is split increasing the height of the tree.

  38. Updates on B+-Trees: Insertion (Cont.) B+-Tree before and after insertion of “Clearview”

  39. Updates on B+-Trees: Insertion (Cont.) • Insert the following values into a B+ tree: 2 31 3 29 5 23 7 19 11 17

  40. Updates on B+-Trees: Insertion (Cont.) 31 3 29 5 23 7 19 11 17 2

  41. Updates on B+-Trees: Insertion (Cont.) 3 29 5 23 7 19 11 17 2 31

  42. Updates on B+-Trees: Insertion (Cont.) 29 5 23 7 19 11 17 2 3 31

  43. Updates on B+-Trees: Insertion (Cont.) 29 5 23 7 19 11 17 2 3 31

  44. Updates on B+-Trees: Insertion (Cont.) 5 23 7 19 11 17 2 3 29 31

  45. Updates on B+-Trees: Insertion (Cont.) 5 23 7 19 11 17 2 3 29 31 29

  46. Updates on B+-Trees: Insertion (Cont.) 5 23 7 19 11 17 2 3 29 31 29

  47. Updates on B+-Trees: Insertion (Cont.) 23 7 19 11 17 2 3 5 29 31 29

  48. Updates on B+-Trees: Insertion (Cont.) 23 7 19 11 17 2 3 5 29 31 29

  49. Updates on B+-Trees: Insertion (Cont.) 7 19 11 17 2 3 29 31 29 5 23

  50. Updates on B+-Trees: Insertion (Cont.) 7 19 11 17 29 31 2 3 5 23 29 5

More Related