Overview. The Problem Definitions The algorithm Analysis Lower Bounds Deterministic algorithm
The Bloomier Filter Bernard Chazzelle Princeton Un., NEC Lab. Joe Killian NEC Lab. Ronitt Rubinfeld NEC Lab. Ayellet Tal Technion, Princeton Un. Presented by Lilach Bien
Overview • The Problem • Definitions • The algorithm • Analysis • Lower Bounds • Deterministic algorithm • Mutable version of the problem
The Problem Bloom & Bloomier Filters
The Problem – Bloom Filters • A large set of data D, with a small subset S • We want to query whether an item d belongs to S • No false negative rate (if d belongs to S we’ll recognize it) • A small positive rate (we may say d belongs to S, although it doesn’t) • Allowing a small positive rate enables to build a compact data structure
The Problem – Bloomier Filters • Bloom Filters – membership queries on a small subset of D. • Bloomier Filters – computing arbitrary functions defined only in a small subset of D. • The function will be computed correctly for all members of S (no false negative) • For items not in S, we almost always return a special value . • Allow dynamic updates to the function, if S doesn’t change.
Example • D={1,…100} S={1,…3} R={1,2} • f(1)=1 f(2)=1 f(3)=2 • 1 2 87 55 40 1 1 1 • f(2)=2 • 66 2 3 2 2
Bloomier Filters - Uses • Building a meta database for a union of databases. Keeps track of which database contains information about each entry. • Maintaining directories if the data or code is maintained in multiple locations.
Formal Definitions • f is a function from D={0,…,N-1} • The range is R={,1,…,2r-1} • S = {t1,…tn} is a subset of D of size n. • f(ti)=vi viR • f(x)= for x outside of S • f can be specified by the assignment A={(t1,v1),…,(tn,vn)}
Formal Definitions (Cont.) • Bloomier filters allow to query f at any point of S always correctly • For a random xD\S the query return f(x)= with probability 1- • The input to the algorithm is A and
Supported Operations • CREATE (A): • Given an assignment A={(t1,v1),…(tn,vn)}, we initialize the data structure Tables. • SET_VALUE(t,v,Tables): • For tD and v R we associate the value v with the domain element t in Tables. • It is required that t belongs to S.
Supported Operations (Cont.) • LOOKUP(t, Tables): • For tS we return the last value v associated with t. • For all but a fraction of D\S we return . • For the remaining elements of D\S we return an arbitrary element of R.
The Idea • We encode the values in R as elements of the additive group X={0,1}q • Addition in Q is bitwise XOR • Any xR is transformed to Q by its q-bit binary expansion ENCODE(x) • For y Q we define DECODE(y) as • The corresponding number in R, if y<|R| • otherwise
The Idea (Cont.) • We’ll save the function values for elements of S in a table. • We’ll use a hash function to compute a random q-bit masking value M for every x in D. • To lookup the value of x, we’ll access a set of places in the table and calculate a q-bit number a. • We’ll return M XOR a.
The Idea (Cont.) • If t is in S – we’ll build the table so a XOR M = f(t). • Otherwise, since M is random, we’ll get a random q-bit number y. • Proof: For the i’th bit of y • Suppose ai=0 (without loss of generality) • We get
The Idea (Cont.) • Since y is random, for big enough q, DECODE(y) will return with high probability • If we save in the table elements of R (y is an element of R) DECODE(y) will not return with probability |R|/2q • We can do better.
Using 2 Tables • We have a table of size m, and a hash function HASH: D{1,..,m}k • if HASH(t) = (h1,..,hk) we say that {h1,…,hk} is the neighborhood of t, N(t) • For large enough m and k, we can choose for each tS an element (t) from HASH(t) such that: • For each t’S, t’≠t, it holds that (t) ≠ (t’) • If (t) =hi we use (t) to denote i.
Using 2 Tables (Cont.) • We’ll use 2 tables: • The first table will store values in {,1,…,k} encoded as values in Q. • It will return(t) for t in S, and return for most of the other items. • The second table will store values in R. For each t in S the value f(t) in will be in place (t) .
Using 2 Tables (Cont.) • If x is in D/S then with probability k/2q the first table will not return . • With probability k/2q we will access the second table and return “garbage”. • Now we can also change function values if we want. • We use the first table to check which place in the second table stores the value we want to change. • We change the value in the second table.
The First Table • Reminder: • We want to use the table to compute a value a for each item t in D. • For items in S, a XOR M will give us the encoded (t). • When we access the first table with an element t we know N(t)={h1,…,hk} and M. • We’ll compute • We want to set the values in the indices of N(t) so a XOR M will give us the encoded (t).
Order Respecting Matching • Let S be a set with neighborhood N(t) defined for each tS. • Let be a complete ordering on the elements of S. • A matching respects (S, ,N) if • For all t S, (t) N(t) • If ti> tj then (ti)N(ti)
Order Respecting Matching (Cont.) • If for N defined by HASH a matching respects (S, ,N) it has all the properties we wanted: • For all t S, (t) N(t) • For all t,t’ S, (t) ≠ (t’) • We may build the first table incrementally so that for a XOR M will give us the encoded (t).
Building The First Table • Input: • Order • Neighborhood N(t) defined by HASH • Order respecting matching • For t= [1],…, [n] we set Table[(t)] so that encodes (t). • Since is order respecting we can’t affect any value already set for t’< t.
Finding A Good Ordering And Matching • We get S and HASH, and compute and so is order respecting. • A location h{1,…,m} is a singleton for S if hN(t) for exactly one tS. • TWEAK(t,S,HASH) is the smallest value j such that hj is a singleton for S, where N(t)=(h1,…,hk) • TWEAK(t,S,HASH)= if no such j exists. • If TWEAK(t,S,HASH) is defined we may set (t)= TWEAK(t,S,HASH). This is an “easy match”.
Finding A Good Ordering And Matching (Cont.) • If t is an easy match it doesn’t collide with the neighborhood of any t’S. • E – the subset of S with easy matches. • H=S/E. • We recursively find (’,’) for H. • We extend (’,’) to (,): • We first put the ordered elements of H, and then the elements of E. • is the union of matchings for H and E.
FIND_MATCH FIND_MATCH (HASH, S)[m, k] Find (, ) for S, HASH 1. E =; = For ti S If TWEAK (ti, S,HASH ) is defined i = TWEAK (ti, S, HASH ) E = E + ti If E = Return (failure) 2. H = S \ E Recursively compute (', ')= FIND_MATCH (HASH ,H)[m ,k]. If FIND_MATCH (HASH ,H)[m,k]=failureReturn (failure) 3. = ' For ti E Add ti to the end of (ie, make ti be the largest element in thus far) Return (; ={1,…,n}) (where i is determined for ti E, in Step 1, and for ti H (via ') in Step 2.)
CREATE CREATE (A = {(t1, v1) …, (tn, vn)})[m, k, q] (create a mutable table) 1. Uniformly choose hash : D {1,…,m}k {0, 1}q S = {t1,…, tn} Create Table1 to be an array of m elements of {0, 1}q Create Table2 to be an array of m elements of R. (the initial values for both tables are arbitrary) Put (HASH , m, k, q) into the "header" of Table1 (we assume that these values may be recovered from Table1) 2. (, ) = FIND_MATCH (hash , S)[m, k] If FIND_MATCH (hash , S)[m, k]= failure Goto Step 1 3. For t = [1], … , [n] v = A(t) (ie, the value assigned by A to t) (h1,…,hk,M) = HASH (t) L = (t); l = (t) (ie, L = hl) Table1 [L] = ENCODE (l) M Table2 [L] = v 4. Return (Table = (Table1,Table2))
LOOKUP & SET_VALUE LOOKUP (t, Table = (Table1,Table2)) 1. Get (HASH, m, k, q) from Table1 (h1,…, hk, M) = HASH (t) l = DECODE (M ) 2. If l is defined L = hl Return (Table2[L]) Else Return () SET_VALUE (t, v, Table = (Table1,Table2)) 1. Get (HASH, m, k, q) from Table1 (h1,…, hk, M) = HASH (t) l = DECODE (M ) 2. If l is defined L = hl Table2[L] = v Return (success) Else Return (failure)
Analyzing FIND_MATCH • We show that FIND_MATCH succeeds with constant probability for every S. • We’ll define a bi-partite graph G: • On the left side there are n vertices L={L1,…,Ln} corresponding to S. • On the right side there are m vertices R={R1,…,Rm} corresponding to {1,…,m} • There is an edge between Li and Rj if for tiS if there is l such that j=hl.
The Singleton Property • We say that G has the singleton property if for all nonempty AL there exists a vertex RiR such that Ri is adjacent to exactly one vertex in A. • If G has the singleton property FIND_MATCH will never get stuck (there will always be easy matches). • N(v) – the set of neighbors of vL. • N(A) – the set of neighbors of the elements in A.
Lossless Expansion Property • We say that G has the lossless expansion property if for all nonempty AL, |N(A)|>k|A|/2 • If G has the lossless expansion property it has the singleton property: • Assume to contrary that there is an A such that each node in N(A) has at least 2 neighbors. • The sub-graph for A has at least 2|N(A)| edges. • Since |N(A)|>k|A|/2, the sub-graph has more than k|A| edges – a contradiction.
Lossless Expansion Property (Cont.) • For a random graph G with • Fixed k, k>2 • m=ckn for a fixed c G is a lossless expander with constant probability. • FIND_MATCH will succede with constant probability.
Data Structure Complexity • The error probability is k/2q • We have to set • Space:O(n(r+log1/ε)) bits • Lookup Time: O(1) • Update Time: O(1)
Data Structure Complexity (Cont.) • FIND_MATCH – we’ll use the graph again. • We may show that with high probability for all non-empty AL, |N(A)|>c|A| for some constant c>k/2. • For a set A L we’ll assume there are a items in N(A) with one neighbor and c|A|-a items with more than one neighbor. • The sub-graph for A has at least a+2(c|A|-a)=2c|A|-a edges. • On the other hand it has at most k|A| edges.
Data Structure Complexity (Cont.) • Each item in A has at most k neighbors. • The number of items in A that has neighbors that belong only to them is at least a/k (2c-k)|A|/k = (2c/k-1)|A|=p|A| • These items are easy matches. • The run-time of FIND_MATCH is, if there is such c is O(n)+O((1-p)n)+O((1-p)2n)+…=O(n) • That is also the expected run-time of CREATE
Deterministic Algorithm • If R={1,2,}, S splits into subsets A and B that map to 1 and 2, resp. • Even in that case deterministic Bloomier filtering requires Ω(n + log log N) bits of storage. • Define G - a graph where each node is a vector in {-1,0,1}N with exactly n coordinates equal to 1, and n others equal to -1. • The 1’s represent A and the -1’s represent B. • Two nodes v and v’ are adjacent if the set A of v intersects the set B of v’ (if v=(x1,…,xN) and v’=(y1,…yN) they are adjacent if there is i such that xiyi=-1)
Deterministic Algorithm (Cont.) • Since the memory is the only source of information about A and B no 2 adjacent node should correspond to the same memory configuration. • The memory size m is at least logχ(G) (χ(G) is the minimum number of colors required to color G). • We’ll show that χ(G) is between Ω(2n log N) and O(n2n log N).
Lower Bound On χ(G) • For every color c required to color G we have a vector zc in {-1,1}N. • For a node v=(x1,…,xN) we allow xi to be 1 (or -1) only if zi is 1 (or -1). • A set of binary vectors in length l is (k,l) universal if for every choice of k coordinate positions we get all the possible 2k patterns. • We’ll show that zc is (N,n) universal if we turn the minus ones to zeroes.
Lower Bound On χ(G) (Cont.) • Let i1,..,in be n coordinate positions. • For each w in {-1,1}N we have a node v whose i1,..,in coordinates match w. • If v is colored in color c then the i1,..,in coordinates of zc match w. • Therefore, for each choice of n-coordinate positions we get all the possible patterns. • There size of an (N,n) universal set is Ω(2n log N) so this is a lower bound on χ(G) .
Upper Bound On χ(G) • There exists an (N,2n) universal set of vectors of size O(n2n log N). • We’ll turn all the zeroes to minus ones. • We’ll use that group as zc. • Because the set zc is universal we may select for each node is a vector zc that matches the 1’s and -1’s of the node. • c will be the color of the node.
Mutable Filtering • If and the number m of storage bits satisfies for some large enough constant c, the Bloomier Filtering cannot support dynamic updates on S of size 2n. • The proof is for the R={1,2,}, S splits into subsets of size n A and B that map to 1 and 2, resp. • We assume the algorithm is randomized.
Mutable Filtering (Cont.) • Let be a sequence of random choices made by the algorithm, when the input to the algorithm was A and B. • We assume B was a specific set Borg and change A. • For each possible A we have a corresponding memory configuration. • In other words – for each memory configuration we have a family of possibilities to A that led to this configuration. • Let F be the largest family.
Mutable Filtering (Cont.) • Now we change B: For each possible Bnew we get to a different memory configuration. • For each configuration 1i2m there is a family of options to Bnew that leads to it. We mark it by Gi. I II
Mutable Filtering (Cont.) • Given a memory configuration C in II, For any path that leads to it • B can be the Bnew on the path. For each item in such a set we must answer ‘in B’. • A can be the set on the path before configuration in I. For each item in such a set that couldn’t be changed to Bnew on the path we must answer ‘in A’. • Suppose in I we were in the configuration F leads to, and then we randomly chose Bnew. • i(Bnew) denotes j such that BnewG j • In II we have to: • Answer ‘in A’ for each item of a set in F that couldn’t be changed to Bnew • Answer ‘in B’ for each item of a set in Gi(Bnew)
The Proof • is the subset of F whose sets intersect Bnew. • We show that with high probability (over the selection of Bnew) the sets are intersecting. • There is an item for which the algorithm must answer both ‘in A’ and ‘in B’. • There is a set Bnew that causes the algorithm to make errors.
Lk And Its Size • Lk is the set of items that belong to at list k sets in F. • We’ll look at subsets of that belong to Lk and show they intersect. • We first bound the size of Lk. • Fk is the sub-family of F that contains only subsets of Lk.