800 likes | 918 Views
Near Space-Optimal Perfect Hashing Algorithms. Nivio Ziviani Fabiano C. Botelho Department of Computer Science Federal University of Minas Gerais, Brazil. 23rd Annual ACM Symposium on Applied Computing Fortaleza, Brazil, March 19, 2008. Objective of the Presentation.
E N D
Near Space-OptimalPerfect Hashing Algorithms Nivio Ziviani Fabiano C. Botelho Department of Computer Science Federal University of Minas Gerais, Brazil 23rd Annual ACM Symposium on Applied Computing Fortaleza, Brazil, March 19, 2008 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)1
Objective of the Presentation Present two perfect hashing algorithms: • Internal memory based algorithm • External memory based algorithm LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)2
Objective of the Presentation Present two perfect hashing algorithms: • Internal memory based algorithm • External memory based algorithm Both algorithms are practical, time efficient and near space-optimal LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)3
Perfect Hash Function Key set S of size n ... 0 1 n -1 Hash Table Perfect Hash Function ... m -1 0 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)4
Key set S of size n ... 0 1 n -1 Minimal Perfect Hash Function Hash Table ... n -1 0 1 Minimal Perfect Hash Function LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)5
Where to use a PHF or a MPHF? • Ubiquitous in Computer Science: access items based on the value of a key • The work with huge static item sets has become a daily task: • In data warehousing applications: • On-Line Analytical Processing (OLAP) applications • In search engines: • Large vocabularies • To map long URLs in smaller integer numbers that are used as IDs. • To represent the set of visited URLs when the Web is beeing crawled LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)6
Vocabulary Inverted List Collection of n documents Term 1 Term 2 Term 3 Doc 1 Doc 5 ... Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 ... Doc n Term 4 Doc 1 Doc 2 ... Term 5 Doc 3 Doc 4 ... Term 6 Doc 7 Doc 9 ... Term 7 Doc 6 Doc 10 ... Term 8 Doc 1 Doc 5 ... ... Term t Doc 9 Doc 11 ... Inverted Index Indexing LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)7
Vocabulary Inverted List Term 1 Term 2 Term 3 Doc 1 Doc 5 ... Term 4 Doc 1 Doc 2 ... Term 5 Doc 3 Doc 4 ... Term 6 Doc 7 Doc 9 ... Term 7 Doc 6 Doc 10 ... Term 8 Doc 1 Doc 5 ... ... Term t Doc 9 Doc 11 ... Representing the Vocabulary LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)8
Link Analysis WEB Graph: Vertices: URLs Edges: Links 6 2 4 0 8 7 3 5 1 9 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)9
M P H F Mapping URLs to Web Graph Vertices 0 Web Graph Vertices 1 URL 1 URLS 2 URL 2 URL3 3 URL4 URL5 4 URL6 URL7 5 ... 6 URLn ... n-1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)10
Crawling Crawling Pages Pages Web Parser URLs To Be Crawled Visited URLs NEW URLs LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)11
Crawling Crawling Pages Pages Web Parser URLs To Be Crawled Visited URLs NEW URLs LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)12
Representing the Visited URLS • MPHFs is the most compact way of representing the set of visited URLs • This enable us to keep much more URLs in main memory of each machine • When the set of new URLs becomes large, a new MPHF is generated for the whole set of URLs until the internal memory to be completely used. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)13
Lower Bounds for Storage Space • PHFs (m ≈ n): • MPHFs (m = n): LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)14
Uniform Hashing Versus Universal Hashing Key universe U of size u Hash function Range M of size m LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)15
Uniform Hashing Versus Universal Hashing • Uniform hashing • # of functions from U to M? • # of bits to encode each fucntion • Independent functions with values uniformly distributed Key universe U of size u Hash function Range M of size m LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)16
Uniform Hashing Versus Universal Hashing • Uniform hashing • # of functions from U to M? • # of bits to encode each fucntion • Independent functions with values uniformly distributed • Universal hashing • A family of hash functions H is universal if: • for any pair of distinct keys (x1, x2) from U and • a hash function h chosen uniformly from H then: Key universe U of size u Hash function Range M of size m X LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)17
Related Work • Theoretical Results (use uniform hashing) • Practical Results (assume uniform hashing for free) • Heuristics LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)18
Theoretical Results LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)19
Theoretical Results LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)20
Practical Results – Assume Uniform Hashing For Free LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)21
Practical Results – Assume Uniform Hashing For Free LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)22
Practical Results – Assume Uniform Hashing For Free LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)23
Empirical Results LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)24
Internal and External Memory Based Algorithms • Near space optimal • Evaluation in constant time • Function generation in linear time • Simple to describe and implement • Known algorithms with near-optimal space either: • Require exponential time for construction and evaluation, or • Use near-optimal space only asymptotically, for large n • Acyclic random hypergraphs • Used before by Majewski et all (1996): O(n log n) bits • We proceed differently: O(n) bits (we changed space complexity, close to theoretical lower bound) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)25
Random Hypergraphs (r-graphs) • 3-graph: 1 0 2 3 4 5 • 3-graph is induced by three uniform hash functions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)26
Random Hypergraphs (r-graphs) • 3-graph: h0(jan) = 1 h1(jan) = 3 h2(jan) = 5 1 0 2 3 4 5 • 3-graph is induced by three uniform hash functions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)27
Random Hypergraphs (r-graphs) • 3-graph: h0(jan) = 1 h1(jan) = 3 h2(jan) = 5 1 0 h0(feb) = 1 h1(feb) = 2 h2(feb) = 5 2 3 4 5 • 3-graph is induced by three uniform hash functions LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)28
Random Hypergraphs (r-graphs) • 3-graph: h0(jan) = 1 h1(jan) = 3 h2(jan) = 5 1 0 h0(feb) = 1 h1(feb) = 2 h2(feb) = 5 2 3 h0(mar) = 0 h1(mar) = 3 h2(mar) = 4 4 5 • 3-graph is induced by three uniform hash functions • Our best result uses 3-graphs LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)29
The Internal memory based algorithm ... LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)30
jan mar feb apr Acyclic 2-graph Gr: L:Ø h0 2 0 1 3 h1 6 4 5 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)31
jan 0 feb apr {0,5} Acyclic 2-graph Gr: L: h0 2 0 1 3 h1 6 4 5 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)32
jan 0 1 apr {2,6} {0,5} Acyclic 2-graph Gr: L: h0 2 0 1 3 h1 6 4 5 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)33
jan 1 0 2 {2,7} {2,6} {0,5} Acyclic 2-graph Gr: L: h0 2 0 1 3 h1 6 4 5 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)34
2 0 3 1 L: {2,5} {2,6} {2,7} {0,5} Acyclic 2-graph Gr: h0 2 0 1 3 Gr is acyclic h1 6 4 5 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)35
The Family of Algorithms (r = 2) S jan feb mar apr LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)36
Gr: 2 0 1 3 jan mar feb apr 6 4 5 7 The Family of Algorithms (r = 2) S h0 jan feb mar apr Mapping h1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)37
Gr: 2 0 1 3 0 0 jan mar feb apr 1 r 2 0 1 3 2 0 6 4 5 7 3 L: r {0,5} {2,7} {2,6} {2,5} 4 r 5 r 6 1 7 1 The Family of Algorithms (r = 2) g S h0 jan feb mar apr L Mapping Assigning h1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)38
Gr: 2 0 1 3 0 0 jan mar feb apr 1 r 2 0 1 2 0 3 6 4 5 7 3 L: r {0,5} {2,6} {2,7} {2,5} 4 r 5 r 6 1 7 1 The Family of Algorithms (r = 2) g S • Values in the range {0,1, ..., r} • r = 2 or r = 3 • At most 2 bits for each vertex in g h0 jan feb mar apr L Mapping Assigning h1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)39
Gr: 2 0 1 3 0 0 jan mar feb apr 1 r 2 0 6 4 5 7 3 r 4 r 5 r 6 1 7 1 The Family of Algorithms (r = 2) g assigned assigned S h0 jan feb mar apr L Assigning Mapping h1 assigned assigned LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)40
Gr: 2 0 1 3 0 0 jan mar feb apr Hash Table 1 r 0 2 0 mar 1 6 4 5 7 Ranking jan 3 r 2 feb 4 r 3 apr 5 r 6 1 7 1 The Family of Algorithms (r = 2) g assigned assigned S h0 jan feb mar apr L Assigning Mapping h1 assigned assigned LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)41
Gr: 2 0 1 3 0 0 jan mar feb apr Hash Table 1 r 0 2 0 mar 1 6 4 5 7 Ranking jan 3 r 2 feb 4 r 3 apr 5 r 6 1 7 1 The Family of Algorithms (r = 2) g assigned assigned S h0 jan feb mar apr L Assigning Mapping h1 assigned assigned i = (g(h0(feb)) + g(h1(feb))) mod r =(g(2) + g(6)) mod 2 = 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)42
Gr: 2 0 1 3 0 0 jan mar feb apr Hash Table 1 r 0 2 0 mar 1 6 4 5 7 Ranking jan 3 r 2 feb 4 r 3 apr 5 r 6 1 7 1 The Family of Algorithms (r = 2) g assigned assigned S h0 jan feb mar apr L Assigning Mapping h1 assigned assigned i = (g(h0(feb)) + g(h1(feb))) mod r =(g(2) + g(6)) mod 2 = 1 phf(feb) = hi=1 (feb) = 6 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)43
Gr: 2 0 1 3 0 0 jan mar feb apr Hash Table 1 r 0 2 0 mar 1 6 4 5 7 Ranking jan 3 r 2 feb 4 r 3 apr 5 r 6 1 7 1 The Family of Algorithms (r = 2) g assigned assigned S h0 jan feb mar apr L Assigning Mapping h1 assigned assigned phf(feb) = hi=1 (feb) = 6 i = (g(h0(feb)) + g(h1(feb))) mod r =(g(2) + g(6)) mod 2 = 1 mphf(feb) = rank(phf(feb)) = rank(6) = 2 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)44
Use of Acyclic Random Hypergraphs • Sufficient condition for the family of algorithms work (Majewski et al (1996)) • Repeatedly selects h0,h1..., hr-1 • For r = 2, m=cn and c>2, • For c = 2.09, Pra = 0.29 • For r = 3 and c≥1.23: probability tends to 1 • Number of iterations is 1/Pra: • r = 2: 3.5 iterations • r = 3: 1.0 iteration LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)45
Space to Represent the Functions (r = 3) • PHFs (ranking information not required): • g: [0,m-1] → {0,1,2} • m = cn bits, c = 1.23 → 2.46 n bits • Packed PHFs (Range of size 3): • log 3 bits for each entry of g (arithmetic coding) • (log 3) cn bits, c = 1.23 →1.95 n bits • Optimal: 1.17n bits • MPHFs (ranking information required): • g: [0,m-1] → {0,1,2,3} • 2m + εm = (2+ ε)cn bits • For c = 1.23 and ε = 0.125 →2.62 n bits • Optimal: 1.4427n bits. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)46
Experimental Results • Metrics: • Generation time • Storage space • Evaluation time • Collection: • 3,541,615 millions of URLs collected from the web • 64 bytes long on average • Experiments • Commodity PC with a cache of 2 Mbytes • 3.2 GHz, 1 GB, Linux, 32 bits achiteture LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)47
Related Algorithms • Botelho, Kohayakawa, Ziviani (2005) - BKZ • Botelho, Pagh, Ziviani (2007) - Ours • Fox, Chen and Heath (1992) – FCH • Majewski, Wormald, Havas and Czech (1996) – MWHC • Pagh (1999) - PAGH All algorithms coded in the same framework: CMPH free software library (http://cmph.sourceforge.net) LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)48
Generation Time n=3,541,615 keys LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)49
Storage Space n=3,541,615 keys LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)50