Indexing and Hashing

Indexing and Hashing B.Ramamurthy Chapter 11 B.Ramamurthy

Representing Data • Attributes are represented in fixed or variable length collections called “fields” • Fields in turn are put into fixed or variable length collections called records. • Records are stored in physical blocks. • A collection of records that forms a relation is stored as a collection of blocks called a file. • This file different than OS file. How? • Organization is different. • Extra indices to accommodate easy search and access. B.Ramamurthy

Basic Concepts (indexing) • Indexing works the same way as a catalog for a book in a library. • Indexing needs to be efficient to allow fast access to records. • Two types of indices: • ordered indices and • hash indices B.Ramamurthy

Techniques and Evaluation • Access types : types of accesses that are supported efficiently. Search by specific value or by range. • Access time: Time sit takes to find a particular data or a set of data. • Insertion time: Time it takes to insert a new item. • Deletion time: Time it takes to delete an item. • Space overhead : Additional space occupied by the index structure. B.Ramamurthy

Ordered Indices • To gain fast access to records in a file we can use an index structure. • If the file containing the records is sequentially ordered, the index whose search key specifies the sequential order of the file is the primary key index. • Primary key indices are also called clustering indices. B.Ramamurthy

Primary Index • Assume that all files are ordered sequentially on some search key. • Such files, with primary key on the search key, are called index-sequential files. • These files accommodate both sequential and random access to individual records. B.Ramamurthy

Dense and Sparse Index • Dense index: • An index record appears for every search key value in the file. • The index record contains the search key and a pointer to the first data record with that search-key value. • Sparse index: • An index is created only for a few values. Each index contains a value and pointer to first record that contains that value. B.Ramamurthy

Brighton Brighton A-217 750 Downtown Downtown A-101 500 Mianus Downtown A-110 600 Perryridge Mianus A-215 700 Redwood Perryridge A-102 400 Round Hill Perryridge A-201 900 Perryridge A-218 700 Redwood A-222 700 Round Hill A-305 350 Dense Index B.Ramamurthy

Brighton A-217 750 Downtown A-101 500 Downtown A-110 600 Mianus A-215 700 Perryridge A-102 400 Perryridge A-201 900 Perryridge A-218 700 Redwood A-222 700 Round Hill A-305 350 Sparse Index Brighton Mianus Redwood Which one is better? Dense or sparse? It is a trade off Between access time and space overhead. B.Ramamurthy

Multi-level Indices • Indices themselves may become too large for efficient processing. • Example: • Consider file with 100000 records with 10 records in a block. • With sparse index and one index per block we have about 10,000 indices. • Assuming 100 indices fit into a block we need about 100 blocks. • It is desirable to keep the index file in the main memory. • Problem: Searching a large index file becomes expensive. B.Ramamurthy

Multi-level Index • Solution: Index the index file. We treat the index as we would treat any other sequential file and construct a sparse index on the primary index. • We binary-search the outer level index to find the largest search key less than or equal to the one we desire. • Two-level sparse index ; Figure 11.4 B.Ramamurthy

Secondary Index • Secondary index is on attributes whose values are not stored sequentially. • If the search key of a secondary index is not a candidate key, the index needs to be dense too. • We can use an extra level of indirection with buckets at the second level. • See fig.11.5 B.Ramamurthy

Brighton A-217 750 Downtown A-101 500 Downtown A-110 600 Mianus A-215 700 Perryridge A-102 400 Perryridge A-201 900 Perryridge A-218 700 Redwood A-222 700 Round Hill A-305 350 Secondary Index 350 400 500 600 700 750 900 B.Ramamurthy

B+ Tree Index Files • Main disadvantage of the index-sequential file organization is that performance degrades as the file grows both for index lookups and sequential scans. • B+ tree index structure is most widely used of several index structures that maintain their efficiency despite insertion and deletion of data. B.Ramamurthy

B+ Tree Index files • A B+ index tree is a balanced tree in which every path from root to leaf is of same length and each non-leaf node has between ceiling(n/2) and n nodes where n is fixed. • Typical node is a B+ tree: n-1 search keys K1, K2,… Kn-1 n pointers P1, P2, …Pn B.Ramamurthy

P1 K1 P2 K2 …… Pn-1 Kn-1 Pn B+ Tree Node B.Ramamurthy

B+ Tree (contd.) • Structure of a B+ tree • Queries on B+ trees • Updates on B+ trees (insertion , deletion) • B+ file organization • B Tree variation of B+ tree : avoiding redundancy B.Ramamurthy

Hashing • Can we avoid the IO operations that the result from accessing the index file? • Hashing offers a way. • It also provides a way of constructing indices (which need nor be sequential). • We will study static and dynamic hashing. B.Ramamurthy

Hash File Organization • Address of the disk block containing a desired record is computed using a function (hash function) and the search key. • Let K denote set of all search keys, B denote set of all bucket addresses. Hash function h is a function that maps K to B. • Bucket is typically a disk block. B.Ramamurthy

Operations • To insert a record with Ki as key, compute h(Ki) which gives the address of the bucket for the record. If there is space in the bucket then it is stored that bucket. (else chaining?) • To lookup a record with key Ki, compute h(ki). Check with every record in the bucket to obtain the record. • To delete a similar hash, find and delete is followed. B.Ramamurthy

Hash Functions • Hash function should be chosen so that • The distribution of records is uniform. • The distribution is random. • Handling bucket overflows: • May occur due to insufficient number of buckets. • Due to bucket skew. • Solution: Overflow buckets, chaining, double hashing, linear probing, quadratic probing B.Ramamurthy

Hash Indices • Hashing can be used for organizing indices.Hash index organizes search keys with their associated pointers. • See Fig.11.22 • Typically only secondary indices need to be organized using hashing. B.Ramamurthy

Dynamic Hashing • Many of today’s databases grow very large in (a short) time. • If you use static hash function we have three option: • Choose hash function based on current size, • Choose hash function based on anticipated size. • Periodically restructure the hash file in response to growth. • Another solution: dynamic hashing. B.Ramamurthy

Dynamic Hash Techniques • Dynamic hash techniques allow the hash function to be modified dynamically to accommodate the growth and shrinkage of the database. • It is also known as extendable hashing. • Extendable hashing copes with the growth in the database size by splitting and coalescing buckets as the database grows and shrinks. B.Ramamurthy

Indexing and Hashing