300 likes | 312 Views
Learn about index structures for efficient record retrieval, including indexed sequential files and B-trees. Understand their advantages, disadvantages, and operations.
E N D
Terms • Search key • Values for a collection of fields used to find records with “matching” values of these fields • Index entry • Pair (search key value, database pointer) • Index structure • A data structure that allows us to locate a record that satisfies the search key conditions • Primary vs secondary index • Primary indexes return database address of record • Secondary return value of primary key of record • Dense vs sparse • Clustered vs non-clustered
Indexes on sequential files • A index file consists of index entry records • A sorted file of records is called a sequential file • the sorted file itself is called the data file • Records could be stored in • contiguous sequence blocks (can do binary search!) • a linked list of blocks • Index file is sorted in the same order as the data file • Dense indexes • Have an index entry for each record in the data file • Index file likely much smaller than data file • May fit in main memory • Useful to find records and answer some queries without reading the (likely larger) data file
Indexes on sequential files • Three major approaches to reorganize a data file • Use overflow blocks • Insert new blocks in the sequential order • Move records between adjacent block • Since an index file, dense or sparse, is sorted on its search key, it is a sequential file itself • Apply the same principles to it as for the data file
Secondary indexes • Primary indexes determine the location of the indexed records • Secondary indexes do not determine the location of the indexed records • Secondary indexes are always dense, typically with duplicates • Indirection for secondary indexes • To avoid repeating the search key for duplicates, we can have for each distinct search key value a bucket with references to all the records with that search key value • idea naturally leads to inverted indexes in Information Retrieval • For each word, have a secondary index of documents that contain it • Combine all these indexes into one
Conventional indexes • Advantages • Simple • Index is a sequential file • Good for scans • Disadvantages • Inserts are expensive • Loose sequential order • Become imbalanced
B-trees • An index that organizes its blocks into a tree • A B-tree with parameter n • Is a balanced tree since all leaves are required to be at the same level • Each node has space for n search keys and n+1 database pointers to blocks of the tree • All internal nodes have more than ceil(n+1/2) children • Each leaf has no less than floor(n+1/2) keys from the data file • Root has at least two children
B-trees • The keys in each node are stored in increasing sorted order • The keys in the leafs • Are copies of keys from the data file • Appear among all the leaves in increasing sorted order • Each key K in a non-leaf node • Is flanked by two pointers to subtrees • The subtree to its left contains keys < K • The subtree to its right contains keys >= K • K is the smallest key appearing among the leaves of the right subtree
B-tree example n=3 Root 100 120 150 180 30 3 5 11 120 130 180 200 100 101 110 150 156 179 30 35
B-tree example - non-leaf node from non-leaf node 57 81 95 to keys 95 to keys 81 k<95 to keys 57 k<81 to keys < 57
B-tree example – leaf node from non-leaf node 57 81 95 to next leaf in sequence To record with key 57 with key 81 with key 85
Operations and B-trees • Lookup • Equality queries • Traverse the tree from the root down to an appropriate leaf using the keys stored at the nodes to decide which subtree(s) to visit • Range queries in which records with search keys in a given range are sought for • Insertion • Find leaf block B to insert search key K • If B is not full insert K there • If B is full, create new block B’, distribute the keys among the two blocks, and recursively insert a search key for B’ to the parent of B, • update the keys that appear on the paths from B (and B’) to the root
Operations and B-trees • Deletion • To delete search key K, locate leaf block B that contains K • Delete entry for K from B • If B now has fewer than the required keys/pointers • If sibling node B’ has an extra entry that could give to B, move it to B • Else, coalesce by moving all the entries from B to one of its siblings and recursively delete the entry for B from its parent • Update the keys along the path from the modified blocks to the root • Coalescing is often not implemented • Too hard for little benefits
B-trees vs Indexed Sequential Files • Concurrency control harder in B-trees • B-trees consume more space • Static index saves enough accesses to allow for reorganization • On the other hand • DBA does not know when to reorganize • DBA does not know how full to load pages of new index • Buffering • B-tree has fixed buffer requirements • Static index must read several overflow blocks to be efficient (large & variable size buffers needed for this) • LRU is bad for B-trees (always keep the root block in memory) • B-trees are better
Interesting problems • What is a good index structure when • records tend to be inserted with keys that are larger than existing values? • we want to remove older data • What is the best n? • As disks get bigger, main memories increase, and CPUs get faster?
Static hashing • Maintain an array of buckets, each bucket being a chain of blocks • Use a hash function to compute the bucket index for record(s) with given search key K • Search chain of blocks of bucket h(k) for the sought records • Inserts/deletes are basic linked list insert/deletes • Collisions can lead to deteriorated performance by having few long chains
Static hashing • Uses fixed hash function and number of buckets • If #buckets is too small, performance will degrade due to frequent overflows and long chains • If #buckets is made large to anticipate future growth, secondary storage is wasted initially or whenever the data file shrinks • One rather expensive option is to periodically rebuild the hash index • Another option would be to devise hashing that allows for the hash function or #buckets to change dynamically with small overhead
Extensible hashing • Uses a hash function h that generates values over [0, 2b) • at any time it uses only a prefix (high order bits) of i bits of the hash • i is initially 0 and changes as the size of the data file changes • it maintains an bucket array with 2i pointers to buckets • the i-prefix h(K)[i] is used to index into the bucket array • Multiple entries of the bucket array may point to the same bucket • Each bucket j stores with it the bucket prefix ij • All the keys it contains have the same value for their ij prefix • There are 2 (i-ij) pointers to bucket j from the bucket array • The number of distinct buckets changes dynamically due to coalescing and splitting of buckets.
Operations with extensible hashing • Lookup bucket for search key K • Use the i high order bits h(K)[i] of the hash to index into the bucket array and get the bucket • Inserting search key K into bucket j • If there is room in the bucket j insert record in the bucket • Else split the bucket j and attempt the insertion again (use overflow buckets if full again) • Deleting search key K from bucket j • Delete K from bucket j • Remove the bucket j if empty, by nullifying the relevant bucket array entry • Coalescing of buckets can be done • can coalesce only with a “buddy” bucket having same value of ij and same ij –1 prefix, if it is present • Decreasing bucket array size is also possible
Splitting buckets in extensible hashing • Splitting bucket j depends on the #pointers to it in the bucket array • If i > ij (more than one pointer to bucket j) • allocate a new bucket z with bucket prefix ij +1 • Set the bucket prefix of j to ij +1 • make the highest half of the bucket array entries pointing to j to now point to z • remove and reinsert each record in bucket j • If i = ij(only one pointer to bucket j) • increment i and double the size of the bucket array • replace each entry in the bucket array by two entries that point to the same bucket
Extensible hashing - example Initial Extendable Hash structure, bucket size = 2 records
Extensible hashing - example • After inserting of one Brighton and two Downtown records • After inserting of Mianus
Extensible hashing - example After inserting three Perryridge records
Extensible hashing - example Hash structure after insertion of Redwood and Round Hill records
Extensible hashing • Advantages • Hash performance does not degrade with growth of file • Minimal space overhead • No full reorganizations • Disadvantages • Extra level of indirection to find desired record • Not bad, if bucket array is in memory • Bucket array may itself become very big (larger than memory) • Need a tree structure to locate desired record in the structure! • Changing size of bucket array is an expensive operation • Linear hashingis an alternative mechanism which avoids these disadvantages at the possible cost of more bucket overflows
Linear hashing • Hash function h that generates values over [0, 2b) • Uses suffix h(K)[i] of i low order bits of hash h(K) • Maintains array with n buckets • Attempts to limit the load factor r/n, where r=#records • Linear hashing has all the advantages of extensible hashing and • Grows the #buckets slower • Less expensive to change the size of the bucket array • Long chains still possible
Operations for Linear hashing • Lookup bucket for search key K • If m=h(K)[i] < n then return bucket m else return bucket m-2i-1 • Insert record with search key K • Lookup the bucket j for K • If bucket j has space insert it there, else chain a new block to the bucket and insert the record there • If the load factor is above a threshold • add a new bucket to the bucket array, whose number is 1a2a3…ai • split the records in bucket 0a2a3…ai with the new bucket • If n=2i then increment i by 1
Indexing vs hashing • Hashing good for probes given key • SELECT * FROM R WHERE R.K=5 • Indexing good for range queries • SELECT * FROM R WHERE R.K BETWEEN 5 AND 10;
Index definition in SQL • CREATE {UNIQUE} INDEX idxname ON rname(attrs) • DROP INDEX idxname;