History-Independent Cuckoo Hashing

  1. History-IndependentCuckoo Hashing Gil Segev Moni Naor Udi Wieder Weizmann InstituteIsrael Microsoft Research Silicon Valley

  2. Election Day Carol Alice Alice Bob • Elections for class president • Each student whispers in Mr. Drew’s ear • Mr. Drew writes down the votes Carol • Problem:Mr. Drew’s notebook leaks sensitive information • First student voted for Carol • Second student voted for Alice • … Alice Alice May compromise the privacy of the elections Bob

  3. Election Day • What about more involved applications? • Write-in candidates • Votes which are subsets or rankings • …. Carol Alice Alice Bob Alice 1 1 • A simple solution: • Lexicographically sortedlistof candidates • Unary counters Bob 1 Carol 1

  4. Learning From History Alice Bob Carol • The two levels of a data structure • “Legitimate” interface • Memory representation • History independence The memory representation should not reveal information that cannot be obtained using the legitimate interface • A simple example: sorted list • Canonical memory representation • Not really efficient...

  5. Typical Applications • Incremental cryptography [BGG94, Mic97] • Voting [MKSW06, MNS07] • Set comparison & reconciliation [MNS08] • Computational Geometry [BGV08] • ...

  6. Our Contribution The first HI dictionary that simultaneously achieves the following: • Efficiency: • Lookup time – O(1) worst case • Update time – O(1) expected amortized • Memory utilization 50% (25% with deletions) • Strongest notion of history independence • Simple and fast

  7. Notions of History Independence • Micciancio (1997): oblivious trees • Motivated by incremental cryptography • Only considered the shape of the trees and not their memory representation • Naor and Teague (2001) • Memory representation • Weak & strong history independence

  8. Notions of History Independence • Weak history independence • Memory revealed at the end of an activity period • Any two sequences of operations S1 and S2 that lead to the same content induce the same distribution on the memory representation • Strong history independence • Memory revealed several timesduring an activity period • Any two sets of breakpoints along S1 and S2with the same content at each breakpoint, induce the same distributions on the memory representation at all these points • Completely randomizing memory after each operation is not good enough.

  9. Notions of History Independence • We consider strong history independence • Canonical representation (up to initial randomness) implies SHI • Other direction shown to hold for reversible data structures [HHMPR05] • Weak & strong are not equivalent • WHI for reversible data structures is possible without a canonical representation • Provable efficiency gaps [BP06] (in restricted models) 9

  10. SHI Dictionaries Memory utilization Update time Lookup time Deletions Practical? Naor & Teague ‘01 O(1) expected O(1) worst case 99% (mem. util. < 50%) Blelloch & Golovin ‘07 O(1) expected O(1) expected 99% (mem. util. < 50%) ? Blelloch & Golovin ‘07 O(1) expected O(1) worst case < 9% > 25%(> 50%) O(1) expected O(1) worst case This work

  11. Our Approach • Cuckoo hashing [PR01]:A simple & practical scheme with worst caseconstant lookup time • Force a canonical representation on cuckoo hashing • No significant loss in efficiency • Avoid rehashing by using a small stash • What happens when hash functions fail? • Rehashing is highly problematic in SHI data structures • All hash functions need to be sampled in advance • When an item is deleted, may need to roll back on previous functions • We use a secondary storage to reduces the failure probability exponentially [KMW08]

  12. Cuckoo Hashing • Tables T1 and T2 with hash functions h1 and h2 • Store x in one of T1[h1(x)] and T2[h2(x)] Insert(x): • Greedily insert in T1 orT2 • if both are full insert in T1 • Repeat in other table with the previous occupant (if any) T1 T2 T1 T2 V V Successful insertion Z Y Z X Y W W X

  13. Cuckoo Hashing • Tables T1 and T2 with hash functions h1 and h2 • Store x in one of T1[h1(x)] and T2[h2(x)] Insert(x): • Greedily insert in T1 orT2 • if both are full insert in T1 • Repeat in other table with the previous occupant (if any) T1 T2 V Failure –rehash required U Z Y X

  14. The Cuckoo Graph • Set S ½ U containing n keys • h1, h2 : U! {1,...,r} S is successfully stored Every connected componenthas at most one cycle Main theorem: If r ¸ (1 + ²)n and h1,h2are log(n)-wise independent,then failure probability is £(1/n) Bipartite graph with sets of size r Edge (h1(x), h2(x)) for every x2S

  15. The Canonical Representation • Assume that S can be stored using h1 and h2 • We force a canonical representation on the cuckoo graph • Suffices to consider a single connected component • Assume that S forms a tree in the cuckoo graph. Typical case • One location must be empty. The choice of the empty location uniquely determines the location of all elements a b c d e Rule: h1(minimal element) is empty

  16. The Canonical Representation • Assume that S can be stored using h1 and h2 • We force a canonical representation on the cuckoo graph • Suffices to consider a single connected component • Assume that S has one cycle • Two ways to assign elements in the cycle • Each choice uniquely determines the location of all elements a b c d e Rule: minimal element in cycle lies in T1

  17. The Canonical Representation • Updates efficiently maintain the canonical representation • Insertions: • New leaf: check if new element is smaller than current min • new cycle: • Same component… • Merging two components… • All cases straight forward • Deletions: • Find the new min, split component,… • Requires connecting all elements in the component with a cyclic list • Memory utilization drops to 25% • All cases straight forward • Update time < size of component = expected (small) constant

  18. Rehashing • What if S cannot be stored using h1 and h2? • Happens with probability 1/n • Can we simply pick new functions? • Canonical memory implies we need to sample all hash functions in advance • Whenever an item is deleted, need to check whether we can role back to previous hash functions • A bad item which is repeatedly inserted and deleted would cause a rehash every operation!

  19. Using a Stash • Whenever an insert fails, put a ‘bad’ item in a secondary data structure • Bad item: smallest item that belongs to a cycle • Secondary data structure must be SHI in itself • Theorem [KMW08]: Pr[|stash| > s] < n-s • In practice keeping the stash as a sorted list is probably the best solution • Effectively the query time is constant with (very) high probability • In theory the stash could be any SHI with constant lookup time • A deterministic hashing scheme, where the elements are rehashed whenever the content changes [AN96, HMP01]

  20. Conclusions and Problems • Cuckoo hashing is a robust and flexible hashing scheme • Easily ‘molded’ into a history independent data structure • We don’t know how to analyze variants with more than 2 hash functions and/or more than 1 element per bucket • Expected size of connected component is not constant • Full performance analysis

