220 likes | 336 Views
Quick Review of Apr 10 material. B+-Tree File Organization similar to B+-tree index leaf nodes store records, not pointers to records stored in an original file leaf and interior nodes are different B-trees search key values appear only once
E N D
Quick Review of Apr 10 material • B+-Tree File Organization • similar to B+-tree index • leaf nodes store records, not pointers to records stored in an original file • leaf and interior nodes are different • B-trees • search key values appear only once • pointer to record/bucket for that search key value always stored with the search key itself, even in interior nodes • Hashing Overview • Hash functions • ideally uniform, random, easy to compute
Today • Overflow • Hash file performance • Hash indices • Dynamic Hashing (Extendable Hashing) • Note: HW#3 due next class (April 17) • HW #4: due Thursday April 24 (9 days from now) • Questions: 12.11, 12.12, 12.13, 12.16
Overflow • Overflow is when an insertion into a bucket can’t occur because it is full. • Overflow can occur for the following reasons: • too many records (not enough buckets) • poor hash function • skewed data: • multiple records might have the same search key • multiple search keys might be assigned the same bucket
Overflow (2) • Overflow is handled by one of two methods • chaining of multiple blocks in a bucket, by attaching a number of overflow buckets together in a linked list • double hashing: use a second hash function to find another (hopefully non-full) bucket • in theory we could use the next bucket that had space; this is often called open hashing or linear probing. This is often used to construct symbol tables for compilers • useful where deletion does not occur • deletion is very awkward with linear probing, so it isn’t useful in most database applications
Hashed File Performance Metrics • An important performance measure is the loading factor (number of records)/(B*f) B is the number of buckets f is the number of records that will fit in a single bucket • when loading factor too high (file becomes too full), double the number of buckets and rehash
Hashed File Performance (Assume that the hash table is in main memory) • Successful search: best case 1 block; worst case every chained bucket; average case half of worst case • Unsuccessful search: always hits every chained bucket (best case, worst case, average case) • With loading factor around 90% and a good hashing function, average is about 1.2 blocks • Advantage of hashing: very fast for exact queries • Disadvantage: records are not sorted in any order. As a result, it is effectively impossible to do range queries
Hash Indices • Hashing can be used for index-structure creation as well as for file organization • A hash index organizes the search keys (and their record pointers) into a hash file structure • strictly speaking, a hash index is always a secondary index • if the primary file was stored using the same hash function, an additional, separate primary hash index would be unnecessary • We use the term hash index to refer both to secondary hash indices and to files organized using hashing file structures
Example of a Hash Index Hash index into file account, on search key account-number; Hash function computes sum of digits in account number modulo 7. Bucket size is 2
Static Hashing • We’ve been discussing static hashing: the hash function maps search-key values to a fixed set of buckets. This has some disadvantages: • databases grow with time. Once buckets start to overflow, performance will degrade • if we attempt to anticipate some future file size and allocate sufficient buckets for that expected size when we build the database initially, we will waste lots of space • if the database ever shrinks, space will be wasted • periodic reorganization avoids these problems, but is very expensive • By using techniques that allow us to modify the number of buckets dynamically (“dynamic hashing”) we can avoid these problems • Good for databases that grow and shrink in size • Allows the hash function to be modified dynamically
Dynamic Hashing • One form of dynamic hashing is extendable hashing • hash function generates values over a large range -- typically b-bit integers, with b being something like 32 • At any given moment, only a prefix of the hash function is used to index into a table of bucket addresses • With the prefix at a given moment being j, with 0<=j<=32, the bucket address table size is 2j • Value of j grows and shrinks as the size of the database grows and shrinks • Multiple entries in the bucket address table may point to a bucket • Thus the actual number of buckets is < 2j • the number of buckets also changes dynamically due to coalescing and splitting of buckets
Use of Extendable Hash Structure • Each bucket j stores a value ij; all the entries that point to the same bucket have the same values on the first ij bits • To locate the bucket containing search key Kj; • compute H(Kj) = X • Use the first i high order bits of X as a displacement into the bucket address table and follow the pointer to the appropriate bucket • T insert a record with search-key value Kj • follow lookup procedure to locate the bucket, say j • if there is room in bucket j, insert the record • Otherwise the bucket must be split and insertion reattempted • in some cases we use overflow buckets instead (as explained shortly)
Splitting in Extendable Hash Structure To split a bucket j when inserting a record with search-key value Kj • if i> ij (more than one pointer in to bucket j) • allocate a new bucket z • set ij and iz to the old value ij incremented by one • update the bucket address table (change the second half of the set of entries pointing to j so that they now point to z) • remove all the entries in j and rehash them so that they either fall in z or j • reattempt the insert (Kj). If the bucket is still full, repeat the above.
Splitting in Extendable Hash Structure (2) To split a bucket j when inserting a record with search-key value Kj • if i= ij (only one pointer in to bucket j) • increment i and double the size of the bucket address table • replace each entry in the bucket address table with two entries that point to the same bucket • recompute new bucket address table entry for Kj • now i> ij so use the first case described earlier • When inserting a value, if the bucket is still full after several splits (that is, i reaches some preset value b), give up and create an overflow bucket rather than splitting the bucket entry table further • how might this occur?
Deletion in Extendable Hash Structure To delete a key value Kj • locate it in its bucket and remove it • the bucket itself can be removed if it becomes empty (with appropriate updates to the bucket address table) • coalescing of buckets is possible • can only coalesce with a “buddy” bucket having the same value of ij and same ij -1prefix, if one such bucket exists • decreasing bucket address table size is also possible • very expensive • should only be done if the number of buckets becomes much smaller than the size of the table
Extendable Hash Structure Example Hash function on branch name Initial hash table (empty)
Extendable Hash Structure Example (2) Hash structure after insertion of one Brighton and two Downtown records
Extendable Hash Structure Example (3) Hash structure after insertion of Mianus record
Extendable Hash Structure Example (4) Hash structure after insertion of three Perryridge records
Extendable Hash Structure Example (5) Hash structure after insertion of Redwood and Round Hill records
Extendable Hashing vs. Other Hashing • Benefits of extendable hashing: • hash performance doesn’t degrade with growth of file • minimal space overhead • Disadvantages of extendable hashing • extra level of indirection (bucket address table) to find desired record • bucket address table may itself become very big (larger than memory) • need a tree structure to locate desired record in the structure! • Changing size of bucket address table is an expensive operation • Linear hashing is an alternative mechanism which avoids these disadvantages at the possible cost of more bucket overflows
Comparison:Ordered Indexing vs. Hashing • Each scheme has advantages for some operations and situations. To choose wisely between different schemes we need to consider: • cost of periodic reorganization • relative frequency of insertions and deletions • is it desirable to optimize average access time at the expense of worst-case access time? • What types of queries do we expect? • Hashing is generally better at retrieving records for a specific key value • Ordered indices are better for range queries