Storage and Retrieval Structures

Storage and Retrieval Structures by Ron Peterson

Overview • Storage & Retrieval as an ADT • Simple implementations • Arrays of records • Sorted arrays • Trees • Efficiency issues • Hash tables

S & R ADT • A container with a bunch of records • Each record has a “key” field • Operations: • Add a record • Remove a record by key • Find a record by key, retrieve a copy

Simple Implementations • Array of records • Insert at end, • Find by linear search • Sorted array • Insert in position order, • Find by binary search • Trees and balanced trees • We’ll study this later

Efficiency Issues • Regular arrays – O(N) retrieval • Sorted arrays – O(log N) retrieval, • but O(N) add (Insertion) • Trees – O(log N) retrieval & add • but backup & degenerate tree issues • Balanced trees – O(log N), • but complex & backup issues • Alternative: Hash table – O(C) or close

Hash Table Motivation • How about if we used an array, • but every record had a unique location? • For example, we have an array of employee records, but the key is Employee-ID which goes from 1 to 300 • Employee 17 gets put in location 17 • Add and retrieve are each O(C) • Problem: what if SSN is the key?

The Hash Table Solution • For SSN as the Employee-ID • (as might be needed for Payroll) • One slot per 9-digit ID would require an array of one billion slots; not feasible! • Instead, let’s still have an array of 300 (or a few more) slots and then figure out: • An easy “mapping” function: • LocationIndex = Hash(SSN)

Hash Table Issues • Coming up with a Hash function • Easy to calculate • Result in correct range • Minimize duplicate answers • Duplicates (“collisions”) inevitable • Many-to-one function (keys to location) • Need a plan for dealing with it • “collision handling”

Collision Handling • When adding a record, and a record with a different key is in the location given by the Hash function; • And when retrieving any record that collided when added; • You need to use the same process of what to do next.

Collision Handling Methods • Just increment the location until you find an empty slot (or the key sought) • Called “linear probing” • Provably a bad choice because it tends to create filled up blocks! • Jumps of increasing size (+ wrap-around); • Most common version is “quadratic probing” • Using an overflow area with links

Hash function approaches • Numeric key: just use mod: • Hash(key): return key%Size • Non-numeric key: do a weighted sum of the ASCII codes of the characters: • Char[1] + 5*Char[2] + 17*Char[3] • Then Sum%Size • Special care is usually taken to avoid non-uniformity in distribution of keys.

Design of a Hash Table • Choose a size that leaves room for growth and turn-over (employees leaving?) • Add, Remove, and Find all use the same • Hash function • Collision handling, so • Write a Hash function • Choose & implement a collision handling method

A Few Final Issues • If you run out of slots, you might need to rebuild the whole table with a bigger size. • The size is often chosen as a prime number so that cyclicity in the distribution of keys has the least effect. • New approaches to collision handling are continually being studied. • Hashing to pointers to linked lists can be very effective if the Hash function is good.

Storage and Retrieval Structures