1 / 68

Near Space-Optimal Perfect Hashing Algorithms

Near Space-Optimal Perfect Hashing Algorithms. Nivio Ziviani Fabiano C. Botelho Department of Computer Science Federal University of Minas Gerais, Brazil. International Conference on the Analysis of Algorithms Maresias, Brazil, April 19, 2008. Objective of the Presentation.

mills
Download Presentation

Near Space-Optimal Perfect Hashing Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Near Space-OptimalPerfect Hashing Algorithms Nivio Ziviani Fabiano C. Botelho Department of Computer Science Federal University of Minas Gerais, Brazil International Conference on the Analysis of Algorithms Maresias, Brazil, April 19, 2008 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)1

  2. Objective of the Presentation Present a perfect hashing algorithm that uses the idea of partitioning the input key set into small buckets: • Key set fits in the internal memory • Internal memory based algorithm (IPH) • Key set larger than the internal memory • External memory based algorithm (EPH) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)2

  3. Objective of the Presentation Present a perfect hashing algorithm that uses the idea of partitioning the input key set into small buckets: • Key set fits in the internal memory • Internal memory based algorithm (IPH) • Key set larger than the internal memory • External memory based algorithm (EPH) Algorithm is time efficient, higly scalable and near space-optimal LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)3

  4. Perfect Hash Function Static key set S of size n ... 0 1 n -1 Hash Table Perfect Hash Function ... m -1 0 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)4

  5. Minimal Perfect Hash Function Static key set S of size n ... 0 1 n -1 Minimal Perfect Hash Function Hash Table ... n -1 0 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)5

  6. Where to use a PHF or a MPHF? • Ubiquitous in Computer Science: access items based on the value of a key • Work with huge static item sets: • In data warehousing applications: • On-Line Analytical Processing (OLAP) applications • In Web search engines: • Represent the set of visited URLs when the Web is beeing crawled • Large vocabularies • Map long URLs in smaller integer numbers that are used as IDs LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)6

  7. 0 Web Graph Vertices 1 URL 1 URLS 2 URL 2 URL3 3 URL4 URL5 4 URL6 URL7 5 ... 6 URLn ... n-1 Mapping URLs to Web Graph Vertices LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)7

  8. 0 Web Graph Vertices M P H F 1 URL 1 URLS 2 URL 2 URL3 3 URL4 URL5 4 URL6 URL7 5 ... 6 URLn ... n-1 Mapping URLs to Web Graph Vertices LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)8

  9. Information Theoretical Lower Bounds for Storage Space • PHFs (m ≈ n): • MPHFs (m = n): m < 3n LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)9

  10. Uniform Hashing Versus Universal Hashing Key universe U of size u Hash function Range M of size m LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)10

  11. Uniform Hashing Versus Universal Hashing Uniform hashing • # of functions from U to M? • # of bits to encode each fucntion • Independent functions with values uniformly distributed Key universe U of size u Hash function Range M of size m LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)11

  12. Uniform Hashing Versus Universal Hashing Uniform hashing • # of functions from U to M? • # of bits to encode each fucntion • Independent functions with values uniformly distributed Key universe U of size u Hash function Range M of size m Universal hashing • A family of hash functions H is universal if: • for any pair of distinct keys (x1, x2) from U and • a hash function h chosen uniformly from H then: LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)12

  13. Uniform Hashing Versus Universal Hashing • Intuition behind universal hashing • We often lose relatively little compared to using a completely random map (uniform hashing) • If S of size n is hashed to n2 buckets, with probability more than ½, no collisions occur • Even with complete randomness, we do not expect little o(n2) buckets to suffice (the birthday paradox) • So nothing is lost by using a universal family instead! LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)13

  14. Related Work • Theoretical Results (use uniform hashing) • Practical Results (assume uniform hashing for free) • Heuristics LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)14

  15. Theoretical Results LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)15

  16. Theoretical Results LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)16

  17. Practical Results – Assume Uniform Hashing For Free LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)17

  18. Practical Results – Assume Uniform Hashing For Free LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)18

  19. Practical Results – Assume Uniform Hashing For Free LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)19

  20. Empirical Results LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)20

  21. Internal and External Memory Based Algorithms • Near space optimal • Evaluation in constant time • Function generation in linear time • Simple to describe and implement • Known algorithms with near-optimal space either: • Require exponential time for construction and evaluation, or • Use near-optimal space only asymptotically, for large n • Acyclic random hypergraphs • Used before by Majewski et all (1996): O(n log n) bits • We proceed differently: O(n) bits (we changed space complexity, close to theoretical lower bound) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)21

  22. Random Hypergraphs (r-graphs) • 3-graph: 1 0 2 3 4 5 • 3-graph is induced by three uniform hash functions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)22

  23. Random Hypergraphs (r-graphs) • 3-graph: h0(jan) = 1 h1(jan) = 3 h2(jan) = 5 1 0 2 3 4 5 • 3-graph is induced by three uniform hash functions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)23

  24. Random Hypergraphs (r-graphs) • 3-graph: h0(jan) = 1 h1(jan) = 3 h2(jan) = 5 1 0 h0(feb) = 1 h1(feb) = 2 h2(feb) = 5 2 3 4 5 • 3-graph is induced by three uniform hash functions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)24

  25. Random Hypergraphs (r-graphs) • 3-graph: h0(jan) = 1 h1(jan) = 3 h2(jan) = 5 1 0 h0(feb) = 1 h1(feb) = 2 h2(feb) = 5 2 3 h0(mar) = 0 h1(mar) = 3 h2(mar) = 4 4 5 • 3-graph is induced by three uniform hash functions • Our best result uses 3-graphs LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)25

  26. The Internal memory based algorithm ... LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)26

  27. Acyclic 2-graph Gr: L:Ø h0 1 2 0 3 jan mar apr feb h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)27

  28. Acyclic 2-graph Gr: L: {0,5} h0 1 2 0 3 jan apr feb h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)28

  29. Acyclic 2-graph 0 1 Gr: L: {0,5} {2,6} h0 1 2 0 3 jan apr h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)29

  30. Acyclic 2-graph 0 1 2 Gr: L: {0,5} {2,6} {2,7} h0 1 2 0 3 jan h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)30

  31. Acyclic 2-graph 0 1 2 3 Gr: L: {0,5} {2,6} {2,7} {2,5} h0 1 2 0 3 Gr is acyclic h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)31

  32. Internal Algorithm (r = 2) S jan feb mar apr LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)32

  33. Internal Algorithm (r = 2) S Gr: jan feb mar apr h0 1 2 0 3 jan Mapping mar apr feb h1 4 5 6 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)33

  34. Internal Algorithm (r = 2) g S Gr: 0 r jan feb mar apr r 1 h0 1 2 0 3 r 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 r 7 r 0 1 2 3 L: {0,5} {2,6} {2,7} {2,5} LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)34

  35. Internal Algorithm (r = 2) g S Gr: 0 r jan feb mar apr r 1 h0 1 2 0 3 0 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 r 7 r 0 1 2 3 L: {0,5} {2,6} {2,7} {2,5} LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)35

  36. Internal Algorithm (r = 2) g S Gr: 0 r jan feb mar apr r 1 h0 1 2 0 3 0 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 r 7 1 0 1 2 3 L: {0,5} {2,6} {2,7} {2,5} LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)36

  37. Internal Algorithm (r = 2) g S assigned Gr: assigned 0 0 jan feb mar apr r 1 h0 1 2 0 3 0 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 1 7 1 assigned assigned LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)37

  38. Internal Algorithm (r = 2) g S assigned Gr: assigned 0 0 jan feb mar apr r 1 h0 1 2 0 3 0 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 1 7 1 assigned assigned i = (g(h0(feb)) + g(h1(feb))) mod r =(g(2) + g(6)) mod 2 = 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)38

  39. Internal Algorithm: PHF Hash Table g S assigned Gr: assigned 0 0 mar jan feb mar apr r - 1 h0 1 2 0 3 jan 0 2 L 3 - r jan Mapping Assigning mar apr - r feb 4 5 - r h1 4 5 6 7 6 feb 1 7 apr 1 assigned assigned i = (g(h0(feb)) + g(h1(feb))) mod r =(g(2) + g(6)) mod 2 = 1 phf(feb) = hi=1 (feb) = 6 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)39

  40. Internal Algorithm: MPHF g S assigned Gr: assigned 0 0 jan feb mar apr Hash Table r 1 h0 1 2 0 3 0 0 mar 2 L 3 jan r 1 jan Mapping Assigning Ranking mar apr 2 feb r feb 4 5 apr r 3 h1 4 5 6 7 6 1 7 1 assigned assigned i = (g(h0(feb)) + g(h1(feb))) mod r =(g(2) + g(6)) mod 2 = 1 phf(feb) = hi=1 (feb) = 6 mphf(feb) = rank(phf(feb)) = rank(6) = 2 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)40

  41. Internal Algorithm: Space to Represent the Functions g S Gr: 0 0 jan feb mar apr r 1 • Values in the range {0,1, ..., r} • r = 2 or r = 3 • At most 2 bits for each vertex in g h0 1 2 0 3 0 2 L 3 r jan Mapping Assigning mar apr r feb 4 5 r h1 4 5 6 7 6 1 7 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)41

  42. Space to Represent the Functions (r = 3) • PHFs g: [0,m-1] → {0,1,2} • m = cn bits, c = 1.23 → 2.46 n bits • Packed PHFs (Range of size 3): • (log 3) cn bits, c = 1.23 →1.95 n bits (arith. coding) • Optimal: 0.89n bits • MPHFs g: [0,m-1] → {0,1,2,3} (ranking info required) • 2m + εm = (2+ ε)cn bits • For c = 1.23 and ε = 0.125 →2.62 n bits • Optimal: 1.44n bits. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)42

  43. Use of Acyclic Random Hypergraphs • Sufficient condition to work • Repeatedly selects h0, h1..., hr-1 • For r = 2, m = cn and c ≥ 2.09: Pra = 0.29 • For r = 3, m = cn and c≥1.23: Pra tends to 1 • Number of iterations is 1/Pra: • r = 2: 3.5 iterations • r = 3: 1.0 iteration LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)43

  44. Internal Memory Based Algorithm Summary • Near space-optimal PHFs and MPHFs • Compact functions fit in cache • The algorithm is simpler and has much lower constant factors than existing theoretical results • Outperforms the main practical general purpose algorithms found in the literature considering • generation time • storage space LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)44

  45. The External memory based algorithm ... LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)45

  46. External Memory Based Algorithm First MPHF algorithm for very large key sets (in the order of billions of keys) This is possible because Deals with external memory efficiently Generates compact functions (near space optimal) Uses a little amount of internal memory to scale Works in linear time Two implementations: Theoretical well-founded EPH (uses uniform hashing) Heuristic EPH (uses universal hashing) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)46

  47. The External Perfect Hashing Algorithm (EPH) n-1 0 1 Key Set S … h0 Partitioning Buckets … 0 1 2b - 1 2 Searching MPHF0 MPHF1 MPHF2 MPHF2b-1 Hash Table … n-1 0 1 MPHF(x) = MPHFi(x) + offset[i]; LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)47

  48. Key Set Does Not Fit In Internal Memory n-1 0 Key Set S of β bytes ... … ... µ bytes of Internal memory µ bytes of Internal memory Partitioning h0 h0 … … … 0 0 1 2b - 1 1 2b - 1 2 2 File N File 1 Each bucket ≤ 256 N = β/µ b = Number of bits of each bucket address LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)48

  49. Important Design Decisions • We map long URLs to smaller keys of fixed size using a hash function • Choose a linear time and near space-optimal algorithm to generate the MPHF of each bucket • How do we obtain a linear time complexity? • Using internal radix sorting to form the buckets • Using a heap of N entries to drive a N-way merge that reads the buckets from disk LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)49

  50. Use the Internal Algorithm for Each Bucket g S assigned Gr: assigned 0 0 jan feb mar apr Hash Table r 1 h0 1 2 0 3 0 0 mar 2 L 3 jan r 1 jan Mapping Assigning Ranking mar apr 2 feb r feb 4 5 apr r 3 h1 4 5 6 7 6 1 7 1 assigned assigned LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)50

More Related