380 likes | 456 Views
Sketching in Adversarial Environments Or Sublinearity and Cryptography. Moni Naor. Joint work with: Ilya Mironov and Gil Segev. Comparing Streams. How to compare data streams without storing them?. S A. S B. Step 1: Compress data on-line into sketches
E N D
Sketching in Adversarial EnvironmentsOr Sublinearity and Cryptography Moni Naor Joint work with: Ilya Mironov and Gil Segev
Comparing Streams • How to compare data streams without storing them? SA SB • Step 1: Compress data on-line into sketches • Step 2: Interact using only the sketches • Goal: Minimize sketches, update time, and communication
Comparing Streams • How to compare data streams that cannot to be stored? $ Shared randomness $ • Real-life applications: massive data sets, on-line data,... • Highly efficient solutions assuming shared randomness
Comparing Streams • How to compare data streams that cannot to be stored? $ Shared randomness $ Plagiarism detection • Is shared randomness a reasonable assumption? • No guarantees when set adversarially • Inputs may be adversarially chosen depending on the randomness
The Adversarial Sketch Model “Adversarial” factors: • No secrets • Adversarially-chosen inputs Communication complexity Adversarial sketch model Massive data sets: • Sketching, streaming
The Adversarial Sketch Model • Goal: Compute f(A,B) • Sketch phase • An adversary chooses the inputs of the parties • Provided as on-line sequences of insert and delete operations • No shared secrets • The parties are not allowed to communicate • Any public information is known to the adversary in advance • Adversary is computationally all powerful • Interaction phase small sketches, fast updates low communication & computation
Lower Bound Equality testing in the adversarial sketch model requires sketches of size(K¢log(N/K))1/2 Our Results • Equality testing • A, Bµ[N] of size at most K • Error probability ² • If we had public randomness… • Sketches of size O(log(1/²)) • Similar update time, communication and computation
Our Results • Equality testing • A, Bµ[N] of size at most K • Error probability ² Lower Bound Equality testing in the adversarial sketch model requires sketches of size(K¢log(N/K))1/2 Explicit and efficient protocol: • Sketches of size(K¢polylog(N)¢log(1/²))1/2 • Update time, communication and computationpolylog(N) Upper Bound
Our Results • Symmetric difference approximation • A, Bµ[N] of size at most K • Goal: approximate |A Δ B| with error probability ² • (1 + ½)-approximation for any constant ½ • Sketches of size(K¢polylog(N)¢log(1/²))1/2 • Update time, communication and computationpolylog(N) Upper Bound • Explicit construction: polylog(N)-approximation
Outline • Lower bound • Equality testing • Main tool: Incremental encoding • Explicit construction using dispersers • Symmetric difference approximation • Summary & open problems
Simultaneous Messages Model y x f(x,y)
Simultaneous Messages Model y x adversarial sketch model Lower Bound Equality testing in the private-coin SM model requires communication(K¢log(N/K))1/2 [NS96, BK97] sketches
Outline • Lower bound • Equality testing • Main tool: Incremental encoding • Explicit construction using dispersers • Symmetric difference approximation • Summary & open problems
Simultaneous Equality Testing x y K C(x) C(y) K1/2£K1/2 Communication K1/2
First Attempt row = 3 C(A) C(B) col = 2 C(B)3,2 Sketches of size K1/2 Problem: update time K1/2
Incrementality vs. Distance • Incrementality:Given C(S) and x2[N], the encodings of S [{x} and S \{x} are obtained by modifying very few entries logarithmic • High distance:For every distinct A,Bµ[N] of size at most K, d(C(A),C(B)) > 1 - ² constant • Impossible to achieve both properties simultaneously with Hamming distance
Incremental Encoding S C(S)1, ... , C(S)r r d(C(A),C(B)) = 1 - {1 – dH(C(A)i,C(B)i)} i = 1 Normalized Hamming distance • r=1: Hamming distance • Hope: Larger r will enable fast updates • r corresponds to the communication complexity of our protocol • Want to keep r as small as possible Explicit construction with r = logK: • Codeword size K¢polylog(N) • Update time polylog(N)
Equality Protocol rows (3,1,1) C(B)3 C(A)3 cols (2,3,1), values C(B)2 C(A)2 C(A)1 C(B)1 r {1 – dH(C(A)i,C(B)i)} < ² Error probability: i = 1 1 – d(C(A), C(B))
The Encoding • Global encoding • Map each element to several entries of each codeword • Exploit “random-looking” graphs • Local encoding • Resolve collisions separately in each entry • A simple solution when |A Δ B| is guaranteed to be small
The Local Encoding • Suppose that |A Δ B|·ℓ
Missing Number Puzzle • Let S={1,...,N}\{i} • – random permutation over S: • (1),....,(N) as a one-way stream • One number i is missing • Goal: Determine the missing number i using O(log N) bits What if there are ℓ missingnumbers? • Can it be done using O(ℓ¢logN) bits?
The Local Encoding • Suppose that |A Δ B|·ℓ A simple & well-known solution: • Associate each x2[N] with v(x) such that for anydistinct x1,...,xℓthe vectors v(x1),...,v(xℓ) are linearly-independent C(S) = v(x) x 2 S • If 1·|A Δ B|·ℓ then C(A) C(B) • For example v(x) = (1, x, ..., xℓ-1) • Size & update timeO(ℓ¢logN) Independent of the size of the sets
The Global Encoding • Each element is mapped into several entries of each codeword • The content of each entry is locally encoded C1 Universe of size N C2 C3
The Global Encoding • Each element is mapped into several entries of each codeword • The content of each entry is locally encoded • The local guarantee:If 1·|Ci[y]Å(AΔB)|·ℓ then C(A) and C(B) differ on Ci[y] Considerℓ = 1 1 C1[2] A 2 B 2 C(A) and C(B) differ at least on these entries Universe of size N 1 2 1 2 1 1 2
The Global Encoding • Identify each codeword with a bipartite graph G = ([N],R,E) • For Sµ[N]define(S,ℓ) µ R as the set of all y 2 R for which 1·|(y)ÅS|·ℓ (K, ², ℓ)-Bounded-Neighbor Disperser: For any S ½ [N] such that K · |S| · 2K it holds that |(S,ℓ)| > (1 - ²)|R| S 2 1 Universe of size N 2 1 2
The Global Encoding Bounded-Neighbor Disperser • r = logK codewords, each Ci is identified with a (2i, ², ℓ)-BND • For i = log2|AΔB| we have dH(C(A)i,C(B)i) > 1 - ² • In particular r d(C(A),C(B)) = 1 - {1 – dH(C(A)i,C(B)i)} > 1 - ² i = 1 A C1 B Universe of size N C2 C3
Constructing BNDs (K, ², ℓ)-Bounded-Neighbor Disperser: For any S ½ [N] such that K · |S| · 2K it holds that |(S,ℓ)| > (1 - ²)|R| • Given N and K, want to optimize M, ℓ, ² and the left-degree D Optimal Extractor Disperser ℓ 1 O(1) polylog(N) polylog(N) D log(N/K) 2(loglogN)2 Codeword of length M K M K¢log(N/K) K¢2(loglogN)2 Universe of size N
Outline • Lower bound • Equality testing • Main tool: Incremental encoding • Explicit construction using dispersers • Symmetric difference approximation • Summary & open problems
Symmetric Difference Approximation • Sketch input streams into codewords • Compare s entries from each pair of codewords • di - # of differing entries sampled from the i-th pair • Output APX = (1 + ½)i for the maximal i s.t. di&(1 -²)s KD |AΔB|· APX · (1+½)¢ ¢|AΔB| (1 -²)M non-explicit: » 1explicit: polylog(N) d1 dk A C(A)1, ... , C(A)k B C(B)1, ... , C(B)k
Outline • Lower bound • Equality testing • Main tool: Incremental encoding • Explicit construction using dispersers • Symmetric difference approximation • Summary & open problems
Summary • Formalized a realistic model for computation over massive data sets “Adversarial” factors: • No secrets • Adversarially-chosen inputs Communication complexity Adversarial sketch model Massive data sets: • Sketching, streaming
Summary • Formalized a realistic model for computation over massive data sets • Incremental encoding • Main technical contribution • Additional applications? S C(S)1, ... , C(S)r r d(C(A),C(B)) = 1 - {1 – dH(C(A)i,C(B)i)} i = 1 • Determined the complexity of two fundamental tasks • Equality testing • Symmetric difference approximation
Open Problems • Better explicit approximation for symmetric difference • Our (1 +½)-approximation in non-explicit • Explicit approximation: polylog(N) • Approximating various similarity measures • Lp norms, resemblance,... The Power of Adversarial Sketching • Characterizing the class of functions that can be “efficiently” computed in the adversarial sketch model sublinear sketchespolylog updates • Possible approach: public-coins to private-coins transformation that “preserves” the update time
Computational Assumptions • Better schemes using computational assumptions? • Equality testing: Incremental collision-resistant hashing [BGG ’94] • Significantly smaller sketches • Existing constructions either have very long public descriptions, or rely on random oracles • Practical constructions without random oracles? • Symmetric difference approximation: Not known • Even with random oracles! Thank you!
Pan-Privacy Model output state Data is stream of items, each item belongs to a user Data of different users interleaved arbitrarily Curator sees items, updates internal state, output at stream end Can also consider multiple intrusions Pan-PrivacyFor every possible behavior of user in stream, joint distribution of the internal state at any single point in time and the final output is differentially private
Adjacency: User Level Universe U of users whose data in the stream; x2U • Streams x-adjacentif same projections of users onto U\{x} Example: axbxcxdxxxex and abcdxeare x-adjacent • Both project to abcde • Notion of “corresponding locations” in x-adjacent streams • U -adjacent: 9x 2U for which they are x-adjacent • Simply “adjacent,” if U is understood Note: Streams of different lengths can be adjacent
Example: Stream Density or # Distinct Elements Universe U of users, estimate how many distinct users in U appear in data stream Application: # distinct users who searched for “flu” Ideas that don’t work: • NaïveKeep list of users that appeared (bad privacy and space) • Streaming • Track random sub-sample of users (bad privacy) • Hash each user, track minimal hash (bad privacy)
Pan-Private Density Estimator Inspired by randomized response. Store for each user x 2 Ua single bit bx Initially all bx0w.p.½1w.p. ½ When encounteringxredrawbx0w.p. ½-ε1w.p. ½+ε Final output:[(fraction of 1’s in table - ½)/ε] + noise Distribution D0 DistributionD1 Pan-PrivacyIf user never appeared: entry drawn from D0If user appeared any # of times: entry drawn fromD1D0 and D1 are 4ε-differentially private