530 likes | 674 Views
External Perfect Hashing for Very Large Key Sets. Fabiano C. Botelho. Department of Computer Science Federal University of Minas Gerais, Brazil. Nivio Ziviani. Department of Computer Science Federal University of Minas Gerais, Brazil. What Is The Problem to Solve?.
E N D
External Perfect Hashing for Very Large Key Sets Fabiano C. Botelho Department of Computer Science Federal University of Minas Gerais, Brazil Nivio Ziviani Department of Computer Science Federal University of Minas Gerais, Brazil LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)1
What Is The Problem to Solve? Generate practical MPHFs for very large key sets (in the order of billions of keys) What is required for that? Efficiently dealing with external memory Generate compact functions (near space optimal) Using a little amount of internal memory to scale Work in linear time LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)2
Where to Use a PHF or a MPHF for Large Sets? • Ubiquitous in Computer Science: access items based on the value of a key • The work with huge static item sets has become a daily task. For example: • In data warehousing applications: • On-Line Analytical Processing (OLAP) applications • In search engines: • Large vocabularies (ongoing work) • To map long URLs in smaller integer numbers that are used as IDs LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)3
Mapping Long URLs To Smaller Integer Numbers • Link Analysis (PageRank, Hits, Salsa, and so on): 6 2 4 0 8 7 3 5 1 WEB graph: 9 Vertices: URLs Edges: Links among web pages
M P H F Mapping Long URLs To Smaller Integer Numbers 0 WEB graph vertices 1 URL 1 URLS 2 URL 2 URL3 3 URL4 URL5 4 URL6 URL7 5 ... 6 URLn ... n-1
Perfect Hash Function Key set S of size n ... 0 1 n -1 Hash Table Perfect Hash Function ... m -1 0 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)6
Minimal Perfect Hash Function Key set S of size n ... 0 1 n -1 Minimal Perfect Hash Function Hash Table ... n -1 0 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)7
Lower Bounds For Storage Space • PHFs (m ≈ n): • MPHFs (m = n): LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)8
Uniform Hashing Versus Universal Hashing Key universe U of size u Hash function Range M of size m
Uniform Hashing Versus Universal Hashing • Uniform hashing • # of functions from U to M? • Independent functions with values uniformly distributed • # of bits to encode each fucntion Key universe U of size u Hash function Range M of size m
Uniform Hashing Versus Universal Hashing • Universal hashing • A family of hash functions H is universal if: • for any pair of distinct keys (x1, x2) from U and • a hash function h chosen uniformly from H then: • Uniform hashing • # of functions from U to M? • Independent functions with values uniformly distributed • # of bits to encode each fucntion Key universe U of size u Hash function Range M of size m X
Our Contribution • First external perfect hashing (EPH) algorithm capable of efficiently dealing with sets in the order of billions of keys • This is possible because: • It makes good use of memory hierarchy. • It requires O(N) internal memory words for constructing the functions, where N is much lower than n. • Two implementations: • Theoretical well-founded EPH (uses uniform hashing) • Heuristic EPH (uses universal hashing) • The resulting functions are: • generated in linear time • requires O(n) bits to be stored • evaluated in constant time LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)12
Related Work • Theoretical Results (use uniform hashing) • Practical Results (assume uniform hashing for free) • Empirical Results
The External Perfect Hashing Algorithm (EPH) n-1 0 1 Key Set S … h0 Partitioning Buckets … 0 1 2b - 1 2 Searching MPHF0 MPHF1 MPHF2 MPHF2b-1 Hash Table … n-1 0 1 mphf(x) = mphfi(x) + offset[i];
Key Set Does Not Fit In Internal Memory n-1 0 Key Set S of β bytes ... … ... µ bytes of Internal memory µ bytes of Internal memory Partitioning h0 h0 … … … 0 0 1 2b - 1 1 2b - 1 2 2 File N File 1 Each bucket ≤ 256 N = β/µ b = Number of bits of each bucket address
Important Design Decisions • We map long URLs to smaller keys of fixed size (128 bits) using a hash function • Choose a linear time and near space-optimal algorithm to generate the MPHF of each bucket • How do we obtain a linear time complexity? • Using internal radix sorting to form the buckets • Using a heap of N entries to drive a N-way merge that reads the buckets from disk in one pass
Algorithm Used For The Buckets S f0(x) 2 0 1 information knowledge management Mapping management information knowledge f1(x) 5 3 4 Assigning The resulting functions require approximately 3.3 bits/key to be stored g 0 0 Hash Table 2 1 0 information Ranking 1 2 management 1 3 2 knowledge 2 1 4 5 2 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)24
Why the EPH Algorithm is Well-Founded? First Point: Pool of uniform hash functions on each bucket ≤ 256 Sharing f0 f1 f0 f1 … 0 1 2b - 1 2 Buckets
Why the EPH Algorithm is Well-Founded? Second Point: We have shown how to create that pool based on the linear hash functions proposed by Alon et al (JACM 1999)
Why the EPH Algorithm is Well-Founded? Second Point: We have shown how to create that pool based on the linear hash functions proposed by Alon et al (JACM 1999) The pool
Why the EPH Algorithm is Well-Founded? Second Point: We have shown how to create that pool based on the linear hash functions proposed by Alon et al (JACM 1999) The linear hash functions The pool
Why the EPH Algorithm is Well-Founded? Second Point: Computing 128 bits with the linear hash functions . . .
Why the EPH Algorithm is Well-Founded? Third Point: How to keep the maximum bucket size smaller than l = 256?
The Heuristic EPH Algorithm • Theoretically founded hash functions from the pool: • Slower than simpler universal hash functions and pseudo random hash functions • Require a lookup table that do not fit in cache • Heuristic EPH uses a pseudo random hash function proposed by Jenkins (1997): • Faster hash function • Requires just one random integer number as seed to be stored • Using universal hashing we often loose relatively little compared to using a complete uniform random map LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)32
Experimental Results • Metrics: • Construction time • Storage space • Evaluation time • Collection: • 1.024 billions of URLs collected from the web (64 bytes long on average) • Experiments • Commodity PC with a cache of 512 Kbytes • 1 GHz, 1 GB, Linux, 64 bits architeture LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)33
Construction Time (in Minutes) Time in minutes to construct a MPHF for 32, 128, 512 and 1024 millions of keys
Related Algorithms • Botelho, Kohayakawa, Ziviani (2005) - BKZ • Botelho, Pagh, Ziviani (2007) - BPZ • Botelho, Ziviani (2007) - EPH • Fox, Chen and Heath (1992) – FCH • Czech, Havas and Majewski (1992) – CHM • Pagh (1999) - PAGH All algorithms coded in the same framework: CMPH free software library (http://cmph.sourceforge.net)
Construction Time 2 million keys
Construction Time and Storage Space 2 million keys
Construction Time, Storage Space and Evaluation Time 2 million keys Key length = 64 bytes
Internal Memory Size Versus Generation Time LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)40
CMinimal Perfect Hashing Library • Why to build a library? • Lack of similar libraries in the free software community • To test the applicability of our algorithms out there • Feedbacks: • 1,310 downloads • Incorporated by Debian • Applications: • Norton spam filter (Symantec) • Natural languages translation models (University of California) • Library address: http://cmph.sourceforge.net
Conclusions • We have presented an efficient external memory based algorithm • Two implementations were developed: • Theoretical EPH algorithm • Heuristic EPH algorithm • Both algorithms generate near space-optimal functions in linear time • The functions are computed in time O(1) and are simple enough to be used in practice. • The EPH algorithm is the first theorecally well-founded algorithm that can be used in practice and will work for every key set. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)42
Ongoing Work – Large Vocabularies • Problem: • Large vocabularies use a considerable portion of the main memory • Access to search engine vocabularies follows a Zipf distribution LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)43
Ongoing Work – Large Vocabularies • Solution: • MPHF to index the vocabulay entries in disk and a cache to resolve most of the accesses quickly • Evaluation: • To compare our approach with the one that uses a BTree LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)44
Ongoing Work - Distributed EPH • Problem: • How to build a distributed version of the EPH algorithm? • Solution: • Data parallelism • Evaluation: • To compare with the original EPH algorithm LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)45
? LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)46
Important Design Decisions a) Buckets Logical View a) Buckets Phisical View … … … … … 0 1 2b - 1 2 File 1 File 2 File N • To map long URLs to smaller keys of fixed size by using a hash function • How we obtain a linear time complexity? • Using radix sort to form the buckets in internal memory. • Using a heap of N entries to drive a N-way merge that reads the buckets from disk in the Searching step. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)47
Conclusions – CMinimal Perfect Hashing Library • Why to build a library? • Lack of similar libraries in the free software community • To test the applicability of our algorithms out there • Feedbacks: • We have released 3 versions • 1014 downloads (338 per version on average) • 100 emails thanking us for the library • .deb packages available for Ubuntu (a Linux distribution) • .deb packages distributed with Debian • An invitation to spend a summer in the University of California • A job offer from Symantec Corp. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)48
jan mar feb apr Using uniform hash functions to generate a Random Graph • Two hash functions induce a random 2-graph. • Example for the key set S = {jan, feb, mar, apr} Gr: h0 h0(jan) = 2 h1(jan) = 5 2 0 1 3 h0(feb) = 2 h1(feb) = 6 h0(mar) = 0 h1(mar) = 5 h0(apr) = 2 h1(abr) = 7 h1 6 4 5 7 • Then, r hash functions induce a random r-graph. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)49
Influence of the internal memory size in the generation time LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)50