1 / 30

Index Structures [13]

Learn about index structures for efficient record retrieval, including indexed sequential files and B-trees. Understand their advantages, disadvantages, and operations.

astudillo
Download Presentation

Index Structures [13]

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Index Structures [13]

  2. Terms • Search key • Values for a collection of fields used to find records with “matching” values of these fields • Index entry • Pair (search key value, database pointer) • Index structure • A data structure that allows us to locate a record that satisfies the search key conditions • Primary vs secondary index • Primary indexes return database address of record • Secondary return value of primary key of record • Dense vs sparse • Clustered vs non-clustered

  3. Indexes on sequential files • A index file consists of index entry records • A sorted file of records is called a sequential file • the sorted file itself is called the data file • Records could be stored in • contiguous sequence blocks (can do binary search!) • a linked list of blocks • Index file is sorted in the same order as the data file • Dense indexes • Have an index entry for each record in the data file • Index file likely much smaller than data file • May fit in main memory • Useful to find records and answer some queries without reading the (likely larger) data file

  4. Indexes on sequential files • Three major approaches to reorganize a data file • Use overflow blocks • Insert new blocks in the sequential order • Move records between adjacent block • Since an index file, dense or sparse, is sorted on its search key, it is a sequential file itself • Apply the same principles to it as for the data file

  5. Secondary indexes • Primary indexes determine the location of the indexed records • Secondary indexes do not determine the location of the indexed records • Secondary indexes are always dense, typically with duplicates • Indirection for secondary indexes • To avoid repeating the search key for duplicates, we can have for each distinct search key value a bucket with references to all the records with that search key value • idea naturally leads to inverted indexes in Information Retrieval • For each word, have a secondary index of documents that contain it • Combine all these indexes into one

  6. Conventional indexes • Advantages • Simple • Index is a sequential file • Good for scans • Disadvantages • Inserts are expensive • Loose sequential order • Become imbalanced

  7. B-trees • An index that organizes its blocks into a tree • A B-tree with parameter n • Is a balanced tree since all leaves are required to be at the same level • Each node has space for n search keys and n+1 database pointers to blocks of the tree • All internal nodes have more than ceil(n+1/2) children • Each leaf has no less than floor(n+1/2) keys from the data file • Root has at least two children

  8. B-trees • The keys in each node are stored in increasing sorted order • The keys in the leafs • Are copies of keys from the data file • Appear among all the leaves in increasing sorted order • Each key K in a non-leaf node • Is flanked by two pointers to subtrees • The subtree to its left contains keys < K • The subtree to its right contains keys >= K • K is the smallest key appearing among the leaves of the right subtree

  9. B-tree example n=3 Root 100 120 150 180 30 3 5 11 120 130 180 200 100 101 110 150 156 179 30 35

  10. B-tree example - non-leaf node from non-leaf node 57 81 95 to keys 95 to keys 81 k<95 to keys 57 k<81 to keys < 57

  11. B-tree example – leaf node from non-leaf node 57 81 95 to next leaf in sequence To record with key 57 with key 81 with key 85

  12. Operations and B-trees • Lookup • Equality queries • Traverse the tree from the root down to an appropriate leaf using the keys stored at the nodes to decide which subtree(s) to visit • Range queries in which records with search keys in a given range are sought for • Insertion • Find leaf block B to insert search key K • If B is not full insert K there • If B is full, create new block B’, distribute the keys among the two blocks, and recursively insert a search key for B’ to the parent of B, • update the keys that appear on the paths from B (and B’) to the root

  13. Operations and B-trees • Deletion • To delete search key K, locate leaf block B that contains K • Delete entry for K from B • If B now has fewer than the required keys/pointers • If sibling node B’ has an extra entry that could give to B, move it to B • Else, coalesce by moving all the entries from B to one of its siblings and recursively delete the entry for B from its parent • Update the keys along the path from the modified blocks to the root • Coalescing is often not implemented • Too hard for little benefits

  14. B-trees vs Indexed Sequential Files • Concurrency control harder in B-trees • B-trees consume more space • Static index saves enough accesses to allow for reorganization • On the other hand • DBA does not know when to reorganize • DBA does not know how full to load pages of new index • Buffering • B-tree has fixed buffer requirements • Static index must read several overflow blocks to be efficient (large & variable size buffers needed for this) • LRU is bad for B-trees (always keep the root block in memory) • B-trees are better

  15. Interesting problems • What is a good index structure when • records tend to be inserted with keys that are larger than existing values? • we want to remove older data • What is the best n? • As disks get bigger, main memories increase, and CPUs get faster?

  16. Static hashing • Maintain an array of buckets, each bucket being a chain of blocks • Use a hash function to compute the bucket index for record(s) with given search key K • Search chain of blocks of bucket h(k) for the sought records • Inserts/deletes are basic linked list insert/deletes • Collisions can lead to deteriorated performance by having few long chains

  17. Static hashing • Uses fixed hash function and number of buckets • If #buckets is too small, performance will degrade due to frequent overflows and long chains • If #buckets is made large to anticipate future growth, secondary storage is wasted initially or whenever the data file shrinks • One rather expensive option is to periodically rebuild the hash index • Another option would be to devise hashing that allows for the hash function or #buckets to change dynamically with small overhead

  18. Extensible hashing • Uses a hash function h that generates values over [0, 2b) • at any time it uses only a prefix (high order bits) of i bits of the hash • i is initially 0 and changes as the size of the data file changes • it maintains an bucket array with 2i pointers to buckets • the i-prefix h(K)[i] is used to index into the bucket array • Multiple entries of the bucket array may point to the same bucket • Each bucket j stores with it the bucket prefix ij • All the keys it contains have the same value for their ij prefix • There are 2 (i-ij) pointers to bucket j from the bucket array • The number of distinct buckets changes dynamically due to coalescing and splitting of buckets.

  19. Extensible hashing

  20. Operations with extensible hashing • Lookup bucket for search key K • Use the i high order bits h(K)[i] of the hash to index into the bucket array and get the bucket • Inserting search key K into bucket j • If there is room in the bucket j insert record in the bucket • Else split the bucket j and attempt the insertion again (use overflow buckets if full again) • Deleting search key K from bucket j • Delete K from bucket j • Remove the bucket j if empty, by nullifying the relevant bucket array entry • Coalescing of buckets can be done • can coalesce only with a “buddy” bucket having same value of ij and same ij –1 prefix, if it is present • Decreasing bucket array size is also possible

  21. Splitting buckets in extensible hashing • Splitting bucket j depends on the #pointers to it in the bucket array • If i > ij (more than one pointer to bucket j) • allocate a new bucket z with bucket prefix ij +1 • Set the bucket prefix of j to ij +1 • make the highest half of the bucket array entries pointing to j to now point to z • remove and reinsert each record in bucket j • If i = ij(only one pointer to bucket j) • increment i and double the size of the bucket array • replace each entry in the bucket array by two entries that point to the same bucket

  22. Extensible hashing - example Initial Extendable Hash structure, bucket size = 2 records

  23. Extensible hashing - example • After inserting of one Brighton and two Downtown records • After inserting of Mianus

  24. Extensible hashing - example After inserting three Perryridge records

  25. Extensible hashing - example Hash structure after insertion of Redwood and Round Hill records

  26. Extensible hashing • Advantages • Hash performance does not degrade with growth of file • Minimal space overhead • No full reorganizations • Disadvantages • Extra level of indirection to find desired record • Not bad, if bucket array is in memory • Bucket array may itself become very big (larger than memory) • Need a tree structure to locate desired record in the structure! • Changing size of bucket array is an expensive operation • Linear hashingis an alternative mechanism which avoids these disadvantages at the possible cost of more bucket overflows

  27. Linear hashing • Hash function h that generates values over [0, 2b) • Uses suffix h(K)[i] of i low order bits of hash h(K) • Maintains array with n buckets • Attempts to limit the load factor r/n, where r=#records • Linear hashing has all the advantages of extensible hashing and • Grows the #buckets slower • Less expensive to change the size of the bucket array • Long chains still possible

  28. Operations for Linear hashing • Lookup bucket for search key K • If m=h(K)[i] < n then return bucket m else return bucket m-2i-1 • Insert record with search key K • Lookup the bucket j for K • If bucket j has space insert it there, else chain a new block to the bucket and insert the record there • If the load factor is above a threshold • add a new bucket to the bucket array, whose number is 1a2a3…ai • split the records in bucket 0a2a3…ai with the new bucket • If n=2i then increment i by 1

  29. Indexing vs hashing • Hashing good for probes given key • SELECT * FROM R WHERE R.K=5 • Indexing good for range queries • SELECT * FROM R WHERE R.K BETWEEN 5 AND 10;

  30. Index definition in SQL • CREATE {UNIQUE} INDEX idxname ON rname(attrs) • DROP INDEX idxname;

More Related