1 / 23

Efficient Information Retrieval Using Hashing Algorithm

Learn about the concept of hashing, ideal hashing functions, implementing hashing operations, and minimal perfect hash functions. Explore chained hashing, open addressing, and Sager's method for improving hashing efficiency.

Download Presentation

Efficient Information Retrieval Using Hashing Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hashing Algorithm 9042635 羅正鴻 9142610 林彥廷 9142621 戴嘉宏

  2. Introduction • Hashing , a ubiquitous information retrieval strategy for providing efficient access to information based on a key • Information can usually be accessed in constant time • Hashing’s drawbacks

  3. Concept of hashing • The problem at hand is to define and implement a mapping from a domain of keys to a domain of locations • From the performance standpoint, the goal is to avoid collisions (A collision occurs when two or more keys map to the same location) • From the compactness standpoint, no application ever stores all keys in a domain simultaneously unless the size of the domain is small

  4. Concept of hashing (con’t) • The information to be retrieved is stored in a hash table which is best thought of as an array of m locations, called buckets • The mapping between a key and a bucket is called the hash function • The time to store and retrieve data is proportional to the time to compute the hash function

  5. Hashing function • The ideal function, termed a perfect hash function, would distribute all elements across the buckets such that no collisions ever occurred • h(v) = f(v) mod m • Knuth(1973) suggests using as the value for m a prime number

  6. Hashing function(con’t) • It is usually better to treat v as a sequence of bytes and do one of the following for f(v): (1) Sum or multiply all the bytes. Overflow can be ignored (2) Use the last (or middle) byte instead of the first (3) Use the square of a few of the middle bytes

  7. Implementing hashing • The following operations are usually provided by an implementation of hashing: (1) Initialization (2) Insertion (3) Retrieval (4) Deletion

  8. Chained hashing

  9. Chained hashing(con’t) • In the worst case (where all n keys map to a single location), the average time to locate an element will be proportional to n/2. • In the best case (where all chains are of equal length), the time will be proportional to n/m.

  10. Open addressing

  11. Minimal perfect hash functions • Minimal perfect hash function (MPHF) is a perfect hash function with the property that is hashed m keys to m buckets with no collisions • Cichelli(1980) and of Cercone et al.(1983) proposed two important concepts: (1)using tables of values as the parameters (2)using a mapping, ordering, and searching (MOS) approach

  12. Minimal perfect hash functions(con’t) • Mapping:transform the key set from an original to a new universe • Ordering:place the keys in a sequence that determines the order in which hash values are assigned to keys • Searching:assign hash values to the keys of each level Mapping → Ordering → Searching

  13. Sager’s method and improvement • Sager(1984,1985) formalizes and extends Cichelli’s approach • In the mapping step, three auxiliary(hash) functions are defined on the original universe of keys U: h0:U→{ 0 , …… , m - 1 } h1:U→{ 0 , …… , r - 1 } h2:U→{ r , …… , 2r –1 }

  14. Sager’s method and improvement • The class of functions searched is h(k) = ( h0(k) + g(h1(k)) + g(h2(k)) (mod m) • Sager uses a graph that represents the constraints among keys • The mapping step goes from keys to triples to a special bipartite graph, the dependency graph, whose vertices are the h1(k) and h2(k) values and whose edges represent the words

  15. Sager’s method and improvement

  16. The algorithm • The mapping step

  17. The algorithm (con’t) • The ordering step

  18. The algorithm (con’t) • The searching step

  19. Discussion • Hashing algorithm is a constant-time algorithm, and there are always advantages to being able to predict the time needed to locate a key • The MPHF uses a large amount of space

More Related