Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream

Why Simple Hash Functions Work :Exploiting the Entropyin a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung

The Question • Traditional analyses of hashing-based algorithms & data structures assume atruly random hash function. • In practice: simple (e.g. universal) hash functions perform just as well. • Why?

Outline • Three hashing applications • The new model and results • Proof ideas

Bloom Filters To approximately store S = {x1,…,xT}[N]: • Start with array of M=O(T) zeroes. • Hash each item k=O(1) times to [M] using h : [N]  [M]k, put a one in each location. To test yS: • Hash & accept if ones in all k locations.

Bloom Filter Analysis Thm [B70]:S yS, if h is a truly random hash function, Prh[accept y] = 2-(ln 2)·M/T+o(1). for an optimal choice of k.

Balanced Allocations • Hashing T items into T buckets • What is the maximum number of items, or load, of any bucket? • Assume buckets chosen independently & uniformly at random. • Well-known result: (log T / log log T) maximum load w.h.p.

Power of Two Choices • Suppose each ball can pick two bins independently and uniformly and choose the bin with less load. • Thm [ABKU94]: maximum load log log n / log 2 + (1) w.h.p.

Linear Probing • Hash elements into an array of length M. • If h(x) is already full, try h(x)+1,h(x)+2,… until empty spot is found, place x there. • Thm [K63]: Expected insertion time for T’th item is 1/(1-(T/M)2)+o(1).

Explicit Hash Functions Can sometimes analyze for explicit (e.g. universal [CW79]) hash functions, but • performance somewhat worse, and/or • hash functions complex/inefficient. Noted since 1970’s that simple hash functions match idealized performance on real data.

Simple Hash Functions Don’t Always Work •  pairwise independent hash families & inputs s.t. Linear Probing has (log T) insertion time [PPR07]. • k-wise independent hash families & inputs s.t. Bloom Filter error prob. higher than ideal [MV08]. • Open for Balanced Allocations. • Worst case does not match practice.

Average-Case Analysis? • Data uniform & independent in [N]. • Not a good model for real data. • Trivializes hashing. • Need intermediate model between worst-case and average-case analysis.

Our Model: Block Sources [CG85] • Data is a finite stream, modeled by a sequence of random variables X1,X2,…XT[N] • Each stream element has some kbits of (Renyi) entropy, conditioned on previous elements:where cp(X)=xPr[X=x]2. • Similar spirit to semi-random graphs [BS95], smoothed analysis [ST01].

An Approach • Htruly random: for all distinct x1,…,xT, (H(x1),.. H(xT)) uniform in [M]T. • Goal: if Hrandom universal hash function and X1,X2,…XTis a block source, then (H(X1),.. H(XT)) is “close” to uniform. Randomness extractors!

Classic Extractor Results[BBR88,ILL89,CG85,Z90] • Leftover Hash Lemma:If H : [N]  [M] is a random universal hash function and X has Renyi entropy at least log M + 2log(1/), then(H,H(X)) is -close to uniform. • Thm: If H : [N]  [M] is a random universal hash function and X1,X2,…XT is a block source with Renyi entropy at least log M + 2log(T/) per block, then(H,H(X1),.. H(XT))is -close to uniform.

Sample Parameters • Network flows (IP addresses, ports, transport protocol): N = 2104 • Number of items: T = 216 • Hash range (2 values per item): M = 232. • Entropy needed per item: 64+2log(1/ ). • Can we do better?

Improved Bounds I Thm [CV08]: If H : [N]  [M] is a random universal hash function and X1,X2,…XT is a block source with Renyi entropy at least log M+log T+2log(1/)+O(1) per block, then(H,H(X1),.. H(XT))is -close to uniform. Tight up to additive constant [CV08].

Improved Bounds II Thm [MV08,CV08]: If H : [N]  [M] is a random universal hash function and X1,X2,…XT is a block source with Renyi entropy at least log M+log T+log(1/)+O(1) per block, then(H,H(X1),.. H(XT))is -close to a distribution with collision probability O(1/MT). Tight upto dependence on  [CV08].

Proof Ideas: Upper Bounds 1. Bound average conditional collision probs: cp(H(Xi)| H,H(X1),.. H(Xi-1))  1/M+1/2k. 2a. Statistical closeness to uniform: inductively bound “Hellinger distance” from uniform. 2b. Close to small collision prob: by Markov, get (1/T) ·icp(H(Xi)| H=h,H(X1)=y1,.. H(Xi-1)=yi-1)  1/M+1/(2k) w.p. 1-  over h,y1,..,yi-1

Proof Ideas: Lower Bounds • Lower bound for randomness extractors [RT97]: if k not large enough, then  X of min-entropy k s.t. h(X) “far” from uniform for most h. • Take X1,X2,…XT to be iid copies of X. • Show that error accumulates, e.g. statistical distance grows by a factor of (T) [R04,CV08].

Open Problems • Tightening connection to practice. • How to estimate relevant entropy of data streams? • Cryptographic hash functions (MD5,SHA-1)? • Other data models? • Block source data model. • Other uses, implications?

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream