460 likes | 575 Views
A Practical Minimal Perfect Hashing Method. Fabiano C. Botelho 1 , Yoshiharu Kohayakawa 2 and Nivio Ziviani 1. 1 Dept. of Computer Science Federal, University of Minas Gerais. 2 Dept. of Computer Science, University of São Paulo. What is the Problem to Solve?. Finding algorithms that:
E N D
A Practical Minimal Perfect Hashing Method Fabiano C. Botelho1, Yoshiharu Kohayakawa2 and Nivio Ziviani1 1Dept. of Computer Science Federal, University of Minas Gerais. 2Dept. of Computer Science, University of São Paulo. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)1
What is the Problem to Solve? • Finding algorithms that: • Construct minimal perfect hash functions faster than the ones available in the literature. • Use little memory to generate minimal perfect hash functions. • Generate minimal perfect hash functions that can be represented with a very economical description. • Construct minimal perfect hash functions that can be evaluated very fast. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)2
Characteristics of Our Method. • We are able to find minimal perfect hash functions using cyclic random graphs. • We have to impose a restriction: • The random graphs must have at most 50% of its edges in cycles. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)3
Basic Concepts – Hash Function Set of n keys S case if set for while ... Collision Hash Table Hash Function ... m -1 0 1 2 3 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)4
Basic Concepts – Perfect Hash Function Set of n keys ... 0 1 n -1 Hash Table Perfect Hash Function ... m -1 0 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)5
Basic Concepts – Order Preserving Perfect Hash Function • A perfect hash function h is order preserving if the keys in S are arranged in some order and h preserves this order in the hash table: • For example: • Considering the lexicographic order: • As Fabiano < Nivio < Yoshi, so h(Fabiano) < h(Nivio) < h(Yoshi). LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)6
Basic Concepts – Minimal Perfect Hash Function Set of n keys ... 0 1 n -1 Hash Table Minimal Perfect Hash Function ... n -1 0 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)7
Applications • Nowadays huge collections are common. • Memory efficient storage and fast retrieval of items from static sets: • Words in natural languages. • Reserved words in programming languages or interactive systems. • Universal resource locations (URLs) in Web search engines. • Item sets in data mining techniques. • Among others. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)8
Approach Used for Constructing Minimal Perfect Hash Functions • MOS - Mapping, Ordering and Searching: • Mapping: transforms the key set from the original universe to a new universe. • Ordering: places the keys in a sequential order that determines the order in which hash values are assigned to keys. • Searching: attempts to assign hash values to the keys. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)9
A Start Point • We use as start point an algorithm proposed in: • Z.J. Czech, G. Havas, and B.S. Majewski. An optimal algorithm for generating minimal perfect hash functions. Information Processing Letters, 43(5):257-264, 1992. • Due to the authors we use the acronym CHM to refer to the algorithm. • The CHM Algorithm generates order preserving minimal perfect hash functions using acyclic random graphs. • This implies that we will need more memory to generate and to store the resulting function. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)10
The CHM Algorithm • To generate an acyclic graph, two vertices h1(x) and h2(x) are computed for each key x in S, where • The set of edges is E(G)={{h1(x),h2(x)}: x in S}. • To show how CHM algorithm works, we are going to use a small example with the first-six months shorted: S ={jan, feb, march, apr, mai, jun} LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)11
CHM Algorithm - Mapping Jan, feb, mar, apr, mai, jun G: 7 h1(jan) = 6 h2(jan) = 4 6 0 h1(feb) = 2 h2(feb) = 3 h1(mar) = 3 h2(mar) = 0 5 1 h1(apr) = 7 h2(apr) = 0 4 2 h1(mai) = 6 h2(mai) = 7 3 h1(jun) = 1 h2(jun) = 4 • The resulting graph must be acyclic. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)12
CHM Algorithm - Ordering G: 3 4 Jan, feb, mar, apr, mai, jun 7 jan is placed in address 0 6 0 feb is placed in address 1 mar is placed in address 2 2 0 5 abr is placed in address 3 1 mai is placed in address 4 5 4 2 jun is placed in address 5 3 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)13
The algorithm repeatedly selects h1 and h2. • They proved that if |V(G)|=cn and c>2, then the probability that G is acyclic is: • For c=2.09, this probability is p = 0.342. • The expected number of iterations to obtain an acyclic graph is 1/p = 2.92. How to Obtain an Acyclic Random Graph? LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)14
CHM Algorithm - Searching • Given the acyclic random graph G = (V, E): g:1 3 4 7 g:2 g:3 6 0 2 0 g:0 g:2 5 1 5 g(2) = (1 – 0) mod 6 = 1 g:3 4 2 3 1 g(0)=0 • The problem is to find an assignment of values to V(G) that makes the function: an ordered minimal perfect hash function. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)15
CHM Algorithm – Evaluating the Resulting Function • MPHF: G: h2(jan) = 4 h1(jan) = 6 g:1 3 4 7 g:2 g:3 6 0 2 0 g:0 g:2 5 1 5 g: 1 g:3 4 2 3 1 g:0 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)16
Our Method • The Goals are: • Constructing minimal perfect hash functions for huge set of keys. • Constructing a minimal perfect hash function in O(n) time, using a small constant. • Using litlle memory to generate the functions. • Storing the functions with a very economical description. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)17
Our Method • The algorithm shares several features with the CHM algorithm. • The differences are: • We generate cyclic random graphs G = (V, E) with |V(G)|=cn and |E(G)|=|S|=n, where . • They generate acyclic random graphs with a greater number of vertices: ; • They generate order preserving minimal perfect hash functions while our algorithm does not preserve order • Thus, our algorithm improves the space requirement at the expense of generating functions that are not order preserving. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)18
Our Method The algorithm uses the MOS approach to look for a function: MPHF • Where: • e = {a, b} • a = h1(x) • b = h2(x) • x is in a key of S • To show how our algorithm works, we are going to use a small example with the first-eight months shorted. • S ={jan, feb, march, apr, mai, jun, jul, aug} LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)19
h1(jan) = 7 h2(jan) = 0 h1(feb) = 1 h2(feb) = 2 h1(mar) = 8 h2(mar) = 1 h1(apr) = 3 h2(apr) = 4 h1(mai) = 4 h2(mai) = 8 h1(jun) = 8 h2(jun) = 0 h1(jul) = 3 h2(jul) = 8 h1(ago) = 7 h2(ago) = 8 Our Method - Mapping Jan, feb, mar, apr, mai, jun, jul, ago G must be simple and must have at most 50% of edges in cycles. G: 7 0 1 6 8 5 2 4 3 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)20
How to Obtain a Simple Random Graph? • The probability that G=(V,E), |E| = n and |V| = cn is simple is: • For c=1.15, this probability is p = 0.47. • The expected number of iterations to obtain a simple graph is 1/p = 2.12. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)21
G: 7 0 Ordering 1 6 8 5 2 4 3 Our Method - Ordering G: 7 0 1 6 8 5 2 4 3 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)22
Our Method - Ordering The 2-core of G G: 7 0 1 6 8 5 2 4 3 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)23
Our Method - Ordering The acyclic part of G G: 7 0 1 6 8 5 2 4 3 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)24
How to Obtain a Random Graph With at Most 50% of Edges in Cycles? • The crucial step now is to determine the value of c (in |V(G)|=cn) to obtain a random graph with at most 50% of edges in cycles. • It is equivalent to determine what is the vale of c in which the expected number of edges in the 2-core of G is 0.5|E(G)|. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)25
Determining The Value of c Theoretically • Pittel and Wormald (2005), present detailed results for the 2-core of the giant component of the random graph G. • They have determined that |Vcrit| and |Ecrit| are given by: Where and 0 < T < 1 is the unique solution to the equation Average degree of vertices in G LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)26
Determining The Value of c Theoretically • Using the equations to calculate |Vcrit| and |Ecrit| we have: • We determined empirically that c = 1.15. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)27
Our Method - Searching • The labelling g is defined such that: is a minimal perfect hash function. • First, we obtain the g values for the vertices in Gcrit. • Second, we obtain the g values for the vertices in Gncrit. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)28
Assignment of Values to Critical Vertices • The labels g(v) (v in Vcrit) are assigned in increasing order following a greedy strategy. • The critical vertices v are considered one at a time according to a breadth-first search on Gcrit. • If a candidate value x for g(v) is forbidden because setting g(v)=x would create two edges with the same sum: • Try x+1 for g(v). LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)29
g:1 g:1 7 0 7 0 1 g:0 g:0 8 8 g:2 4 3 4 3 Assignment of Values to Critical Vertices Let us apply the algorithm to the critical graph (2-core) obtained for the considered example in the ordering step: 7 0 1 g:0 8 2 4 3 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)30
g:4 g:5 g:1 g:1 7 0 7 0 1 1 g:0 g:0 8 8 3 3 2 2 g:2 g:2 g:3 g:3 4 3 4 3 5 5 Assignment of Values to Critical Vertices reassignment reassignment 5 6 g:1 7 0 1 4 5 g:0 8 3 2 g:2 g:3 4 3 5 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)31
g:6 g:1 7 0 1 g:0 8 3 2 g:2 g:3 4 3 5 Assignment of Values to Critical Vertices 7 Used addresses: {1,2,3,5,6,7} 6 Unused addresses: {0,4} LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)32
1 0 6 8 g:0 5 2 g:0 1 0 6 4 8 g:0 5 2 Assignment of Values to Non-Critical Vertices Unused addresses: {4} Unused addresses: {0,4} g:0 1 6 8 g:0 5 2 g:0 g:0 g:4 Unused addresses: {} LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)33
g:4 g:5 g:1 g:1 7 0 7 0 1 1 g:0 g:0 8 8 3 3 2 2 g:2 g:2 g:3 g:3 4 3 4 3 5 5 Analysis of The Searching Step reassignment reassignment 6 5 4 5 • We have shown that the maximal value assigned to an edge is: • We also have shown that the number of back edges of G is: Nbedges = |Ecrit| - |Vcrit| + 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)34
Analysis of The Searching Step • Joining these information and considering that . Thus, • If then and a MPHF is generated in linear time. • The only problem is left open is: prove that . LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)35
Experimental Evidences • Experimental evidences that : • Recall: Nbedges = |Ecrit| - |Vcrit| + 1 = 0.501n – 0.401n + 1 = 0.1n +1. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)36
Experimental Settings • Our data consists of a collection of 100 million universe resource locations (URLs) collected from the Web. • The average length of an URL in the collection is 63 bytes. • All experiments were carried on a P.C. with a 2.4 gigahertz processor and 4 gigabytes of main memory. • The table entries showed in the following represent averages over 50 trials. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)37
Experimental Results • Gains: • 59% in the time for constructing a MPHF. • The resulting functions are generated using 25% less memory than the CHM algorithm. • The resulting functions are stored in 55 % of the space that is needed to store the ones generated by the CHM algorithm. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)38
Conclusions • We have presented a practical method to construct minimal perfect hash functions. • The method uses only 24.80n + O(1) bytes to generate the functions. • The method is very fast. • So, it is a good option for huge static sets. • The implementation of the method is available at http://cmph.sf.net over the LGPL free software license. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)39
? LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)40
Heuristic • We have proposed a heuristic that reduces the space requirement to any given value between 1.15n words and 0.93n words. • The heuristic reuses, when possible, the set of x values that caused reassignments, just before trying x+1. • Problem: Decreasing the value of c leads to an increase in the number of iterations to generate G. • For example: • for c=1 and c=0.93, the analytical expected number of iterations are 2.72 and 3.17, respectively • However, the algorithm is yet linear and will need less memory to generate and to store the resulting functions. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)41
Unused g values: {} Unused g values: {} 3 2 3 2 g:2 g:2 2 4 2 4 g:0 g:0 5 5 1 1 g:1 g:1 6 6 5 4 g:3 g:4 7 8 7 8 7 7 Heuristic • Let us suppose that we have the following2-core: Unused g values: {} 3 2 4 g:0 5 6 8 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)42
Unused g values: {5} g:6 3 2 8 g:2 2 4 g:0 5 1 g:1 6 5 4 g:3 g:4 8 7 7 Heuristic Unused g values: {} reassignment g:5 3 2 7 g:2 2 4 g:0 5 1 g:1 6 5 4 g:3 g:4 8 7 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)43
Unused g values: {5} 13 g:6 g:7 3 2 9 8 g:2 2 4 g:0 5 1 g:1 6 5 4 g:3 g:4 8 7 7 7 Heuristic Unused g values: {} 11 reassignment g:6 g:5 3 2 7 8 g:2 2 4 g:0 5 1 g:1 6 5 4 g:3 g:4 8 7 7 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)44
Heuristic • Unfortunately the heuristic does not work for the previous example. • However it works fine for the random graphs. • That is why we are able to reduce the value of c. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)45