120 likes | 205 Views
Experiments with Hashing 15-451 Feb. 15, 2001. Some Hash Functions Bucket Size Distribution Maximum Bucket Sizes. http://www.cs.cmu.edu/~bryant. Parameters. Keys /usr/dict/words N = 45,402 English words From “Aarhus” to “Zurich” 1–28 characters long “antidisestablishmentarianism”
E N D
Experiments with Hashing 15-451 Feb. 15, 2001 • Some Hash Functions • Bucket Size Distribution • Maximum Bucket Sizes http://www.cs.cmu.edu/~bryant
Parameters • Keys • /usr/dict/words • N = 45,402 English words • From “Aarhus” to “Zurich” • 1–28 characters long • “antidisestablishmentarianism” • Hashing • Into M buckets • Load = N/M • 8 different hash functions
Hash Functions • Key x = c1 c2 … clen(K) • Functions • h1(x) = c1 mod M • This is really bad! • Since only have 52 characters • h2(x) = ci mod M • Hashes “not” and “ton” to same bucket • h3(x) = (ai * ci) mod M • ai’s random 22-bit numbers • This should be a good function • h4(x) = (ai * ci + bi) mod M • ai’s, bi’s random 22-bit numbers • This should be even better function
More Hash Functions • h5(x) = (ai * ci) mod M • ai’s random 22-bit numbers • All sums & products computed module p = 524,287 • This should be a good function • h6(x) = (ai * ci + bi) mod M • ai’s, bi’s random 22-bit numbers • All sums & products computed module p = 524,287 • This should be the best function • h7(x) = h6(first 5 characters of K) • hashes “botch”, “botches”, “botching”, and “botched” to same bucket • h8(x) = random(0..M-1) • Not a real hash function • Should represent ideal case
Bucket Size Distribution • Experiment • Hash 45,402 keys into 128 buckets • Load = 354.7 • Average number of keys per bucket • Measure • Range of bucket sizes • Normalize as count/load • Average = 1.0 • Determines how well hash function does at distributing keys
Distribution Observations • Load = 354.7 • h1 is really bad • only uses 52 buckets • Largest one has 4532 elements • h7 is pretty bad too • Good function, but only over first 5 characters • Largest has 529 elements • Rest look fairly decent • h2: 441 max. [Ignores order of characters] • h4: 428 max. [Why not better than h3?] • h6: 409 max. [Why not better than h5?] • h8: 403 max. [Random] • Hey! This should be the best! • h3: 402 max. [Mod p helps] • h5: 400 max. [Mod p helps]
Maximum Bucket Size • Experiment • Hash 45,402 keys into M buckets • M powers of 2 from 128 to 65,536 • Load 354.7 to 0.69 • Measure • Maximum bucket size • Normalize as count/load • Determines worst case access time
Bucket Size Observations • h1 is really bad • only uses 52 buckets • Largest one has 4532 elements, independent of M • h7 is pretty bad too • Good function, but only over first 5 characters • Largest bucket with M=65,536 has 197 elements • h2 doesn’t do very well • Ignores order of characters • Largest bucket with M=65,536 has 163 elements • Rest are comparable • 6–7 elements in largest bucket for M= 65,536 • Compare to theory • When M=N, E[largest bucket size] = log N / log log N • For M=65,536, this would be 16/4 = 4.