310 likes | 425 Views
Simple and Space-Efficient Minimal Perfect Hash Functions. Fabiano C. Botelho. Department of Computer Science Federal University of Minas Gerais, Brazil. Rasmus Pagh. Computational Logic and Algorithms Group IT Univ of Copenhagen, DenMark. Nivio Ziviani. Department of Computer Science
E N D
Simple and Space-Efficient Minimal Perfect Hash Functions Fabiano C. Botelho Department of Computer Science Federal University of Minas Gerais, Brazil Rasmus Pagh Computational Logic and Algorithms Group IT Univ of Copenhagen, DenMark Nivio Ziviani Department of Computer Science Federal University of Minas Gerais, Brazil LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)1
What Is The Problem to Solve? Design, analyze and implement MPHFs that: • Use space close to the optimal • Faster to generate than the ones available in the literature • Fast to compute • Small memory to generate the functions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)2
Perfect Hash Function Key set S of size n ... 0 1 n -1 Hash Table Perfect Hash Function ... m -1 0 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)3
Minimal Perfect Hash Function Key set S of size n ... 0 1 n -1 Minimal Perfect Hash Function Hash Table ... n -1 0 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)4
Lower Bounds For Storage Space • PHFs (m ≈ n): • MPHFs (m = n): LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)5
Related Work • Theoretical Results • Practical Results • Heuristics LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)6
Theoretical Results LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)7
Practical Results LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)8
Heuristics LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)9
Our Family of Algorithms • Near-optimal space • Evaluation in constant time • Function generation in linear time • Simple to describe and implement • Algorithms in the literature with near-optimal space either: • Require exponential time for construction and evaluation, or • Use near-optimal space only asymptotically, for large n • Acyclic random hypergraphs • Used before by Majewski et all (1996): O(n log n) bits • We proceed differently: O(n) bits (we changed space complexity, close to theoretical lower bound) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)10
Our Family of Algorithms - Remark • Chazelle et al (SODA 2004) presented a way of constructing PHFs that is equivalent to ours • It is explained as a modification of the ``Bloomier Filter'' data structure, but they do not make explicit that a PHF is constructed • Our contribution are: • analyze and optimize the constant of the space usage considering implementation aspects • a way of constructing MPHFs from those PHFs. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)11
Random Hypergraphs (r-graphs) • 3-graph: h0(jan) = 1 h1(jan) = 3 h2(jan) = 5 1 0 h0(feb) = 1 h1(feb) = 2 h2(feb) = 5 2 3 h0(mar) = 0 h1(mar) = 3 h2(mar) = 4 4 5 • 3-graph is induced by three uniform hash functions • Our best result uses 3-graphs LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)12
jan mar feb apr Acyclic 2-graph Gr: L:Ø h0 2 0 1 3 h1 6 4 5 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)13
jan 0 feb apr {0,5} Acyclic 2-graph Gr: L: h0 2 0 1 3 h1 6 4 5 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)14
jan 0 1 apr {2,6} {0,5} Acyclic 2-graph Gr: L: h0 2 0 1 3 h1 6 4 5 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)15
jan 1 0 2 {2,7} {2,6} {0,5} Acyclic 2-graph Gr: L: h0 2 0 1 3 h1 6 4 5 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)16
2 0 3 1 L: {2,5} {2,6} {2,7} {0,5} Acyclic 2-graph Gr: h0 2 0 1 3 Gr is acyclic h1 6 4 5 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)17
Gr: 2 0 1 3 jan mar feb apr 6 4 5 7 The Family of Algorithms (r = 2) S h0 jan feb mar apr Mapping h1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)18
Gr: 2 0 1 3 0 0 jan mar feb apr 1 r 2 0 1 3 2 0 6 4 5 7 3 L: r {0,5} {2,7} {2,6} {2,5} 4 r 5 r 6 1 7 1 The Family of Algorithms (r = 2) g S h0 jan feb mar apr L Mapping Assigning h1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)19
Gr: 2 0 1 3 0 0 jan mar feb apr 1 r 2 0 0 2 1 3 6 4 5 7 3 L: r {0,5} {2,7} {2,6} {2,5} 4 r 5 r 6 1 7 1 The Family of Algorithms (r = 2) • Values in the range {0,1, ..., r} • r = 2 or r = 3 • At most 2 bits for each vertex in g g S h0 jan feb mar apr L Mapping Assigning h1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)20
Gr: 2 0 1 3 0 0 jan mar feb apr Hash Table 1 r 0 2 0 mar 1 6 4 5 7 Ranking jan 3 r 2 feb 4 r 3 apr 5 r 6 1 7 1 The Family of Algorithms (r = 2) g assigned assigned S h0 jan feb mar apr L Mapping Assigning h1 assigned assigned phf(feb) = hi=1 (feb) = 6 i = (g(h0(feb)) + g(h1(feb))) mod r =(g(2) + g(6)) mod 2 = 1 mphf(feb) = rank(phf(feb)) = rank(6) = 2 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)21
Use of Acyclic Random Hypergraphs • Sufficient condition for the family of algorithms work (Majewski et al (1996)) • Repeatedly selects h0,h1..., hr-1 • For r = 2, m=cn and c>2, • For c = 2.09, Pra = 0.29 • For r = 3 and c≥1.23: probability tends to 1 • Number of iterations is 1/Pra: • r = 2: 3.5 iterations • r = 3: 1.0 iteration LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)22
g 0 0 1 r 2 0 3 r 4 r 5 r 6 1 7 1 Space to Represent the Functions (r = 2) • PHFs (ranking information not required): • g: [0,m-1] → {0,1} • m = cn bits, c = 2.09 → 2.09 n bits • MPHFs (ranking information required): • g: [0,m-1] → {0,1,2} • 2m + εm = (2+ ε)cn bits • For c = 2.09 and ε = 0.125 → 4.44 n bits • Packed MPHFs (Range of size 3): • log 3 bits for each entry of g (arithmetic coding) • (log 3 + ε)cn bits. • For c = 2.09 and ε = 0.125 → 3.6 n bits. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)23
Space to Represent the Functions (r = 3) • PHFs (ranking information not required): • g: [0,m-1] → {0,1,2} • m = cn bits, c = 1.23 → 2.46 n bits • Packed PHFs (Range of size 3): • log 3 bits for each entry of g (arithmetic coding) • (log 3) cn bits, c = 1.23 →1.95 n bits • Optimal: 1.17n bits • MPHFs (ranking information required): • g: [0,m-1] → {0,1,2,3} • 2m + εm = (2+ ε)cn bits • For c = 1.23 and ε = 0.125 →2.62 n bits • Optimal: 1.4427n bits. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)24
Experimental Results • Metrics: • Generation time • Storage space • Evaluation time • Collection: • 64 bytes long on average (URLs collected from the web) • Experiments • Commodity PC with a cache of 2 Mbytes LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)25
Related Algorithms • Botelho, Kohayakawa, Ziviani (2005) - BKZ • Fox, Chen and Heath (1992) – FCH • Czech, Havas and Majewski (1992) – CHM • Majewski, Wormald, Havas and Czech (1996) – MWHC • Pagh (1999) - PAGH LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)26
Generation Time and Storage Space n=3,541,615 keys LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)27
Evaluation Time n=3,541,615 keys LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)28
Comparison of the Resulting PHFs and MPHFs n=3,541,615 keys LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)29
Conclusions • We have presented an efficient family of algorithms • Near space-optimal PHFs and MPHFs • The algorithms are simpler and has much lower constant factors than existing theoretical results • Outperforms the main practical general purpose algorithms found in the literature considering • generation time • storage space • Implementation available at http://cmph.sf.net • LGPL free software license LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)30
? LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)31