Foundations of Privacy Lecture 5

Foundations of PrivacyLecture 5 Lecturer:Moni Naor

Desirable Properties from a sanitization mechanism • Composability • Applying the sanitization several time yields a graceful degradation • Will see: treleases , each -DP, are t¢ -DP • Next class: (√t+t2,)-DP (roughly) • Robustness to side information • No need to specify exactly what the adversary knows: • knows everything except one row Differential Privacy: satisfies both…

Dwork, McSherryNissim & Smith 2006 Differential Privacy Protect individual participants: Curator/ Sanitizer M D1 + Curator/ Sanitizer M D2

Differential Privacy Protect individual participants: Probability of every bad event - or any event - increases only by small multiplicative factor when I enter the DB. May as well participate in DB… ε-differentially private sanitizer M For all DBs D, all individuals I and all events T Adjacency: D+I andD-I Handles aux input PrA[M(D+I)2 T] ≤ eε≈1+ε e-ε≤ PrA[M(D-I)2 T]

Pr [response] (Bad) Responses: Z Z Z Differing in one user Differential Privacy Sanitizer Mgives -differential privacy if: for all adjacentD1and D2, and all Aµrange(M): Pr[M(D1) 2A] ≤ ePr[M(D2) 2A] ratio bounded • Participation in the data set poses no additional risk

Example of Differential Privacy X is a set of(name,tag 2 {0,1})tuples One query: #of participants with tag=1 Sanitizer : output #of 1’s + noise • noise from Laplace distribution with parameter 1/ε • Pr[noise = k-1] ≈ eε Pr[noise=k] 0 -4 -3 -2 -1 1 2 3 4 5 0 5 -4 -3 -2 -1 1 2 3 4

Pr [response] Z Z Z Bad Responses: (, d)- Differential Privacy Sanitizer Mgives (, d) -differential privacy if: for all adjacentD1and D2, and all Aµrange(M): Pr[M(D1) 2A] ≤ ePr[M(D2) 2A]+ d ratio bounded This course: dnegligible Typical setting and negligible

Example: NO Differential Privacy U set of(name,tag 2{0,1})tuples One counting query: #of participants with tag=1 Sanitizer A: choose and release a few random tags Bad event T: Only my tag is 1, my tag released PrA[A(D+Me)2T] ≥ 1/n PrA[A(D-Me) 2 T] = 0 • Not ε diff private for any ε! • It is (0,1/n) Differential Private PrA[A(D+Me) 2 T] ≤ eε≈1+ε e-ε≤ PrA[A(D-Me) 2 T]

Counting Queries Databasexof sizen Counting-queries Qis a setof predicates q: U {0,1} Query: how manyxparticipants satisfy q? Relaxed accuracy: answer query withinαadditive errorw.h.p Not so bad:someerror anyway inherent in statistical analysis Queryq nindividuals, each contributing a single point in U U Sometimes talk about fraction

Bound on Achievable Privacy Want to get bounds on the • Accuracy • The responses from the mechanism to all queries are assured to be within α except with probability  • Number of queries t for which we can receive accurate answers • The privacy parameter εfor which ε differential privacy is achievable • Or (ε,) differential privacy is achievable

Blatant Non Privacy Mechanism M is Blatantly Non-Private if there is an adversary A that • On any database D of size n can select queries and use the responses M(D) to reconstruct D’ such that ||D-D’||12 o(n) D’ agrees with D in all but o(n) of the entries. Claim: Blatant non privacy implies that M is not(, d)-DP for any constant 

Sanitization Can’t be Too Accurate Usual counting queries • Query: qµ[n] • i2 qdiResponse = Answer + noise Blatant Non-Privacy: Adversary Guesses 99% bits Theorem: If allresponses are within o(n) of the true answer, then the algorithm is blatantly non-private. But: require exponential # of queries .

Proof: Exponential Adversary • Focus on Column Containing Super Private Bit • Assume all answers are within error bound . 1 0 “The database” 0 1 0 1 1 Vector d2 {0,1}n Will show that  cannot be o(n)

Proof: Exponential Adversary for Blatant Non Privacy • Estimate #1’s in all possible sets • 8Sµ[n]: |M(S) – i2Sdi | ≤  • Weed Out “Distant” DBs • For each possible candidate database c 2 {0,1}n: If for anySµ[n]: |i2 Sci – M(S)| > , then rule out c. • If cnot ruled out, halt and output c Claim: Real database d won’t be ruled out M(S): answer on S

? answer 1 answer 3 answer 2 Impossibility of Exponential Queries The result means that we cannot sanitize the data and publish a data structure so that • for all queries the answer can be deduced correctly to within 2 o(n) On the other hand: we will see that we can get accuracy up to log |Q| Sanitizer query 1,query 2,. . . Database

What can we do efficiently? Allowed “too” much power to the adversary • Number of queries: exponential • Computation: exponential • On the other hand: lack of wild errors in the responses Theorem: For any sanitization algorithm: If all responses are within o(√n) of the true answer, then it is blatantly non-private even against a polynomial time adversary making O(n log2 n) random queries.

The Model • As before: database dis a bit string of length n. • Counting queries: • A query is a subset qµ{1, …, n} • The (exact) answer is aq= i2qdi • -perturbation • for an answer: aq±

What If We Had Exact Answers? • Consider a mechanism 0-perturbations • Receive the exact answer aq= i2qdi Then with n linearly independent queries – over the reals we could reconstruct d precisely: • Obtain n linearly equations aq= i2qciand solve uniquely When we have -perturbations : get an inequality • aj- ≤ i2qci ≤ aj+ Idea: use linear programming A solution must • exist: d itself

Privacy requires Ω(√n) perturbation • For every query qj: its answer according to c is • at most 2 far from its (real) answer in d. Consider a database with o(√n) perturbation • Adversary makes t = n log2 n random queries qj, getting noisyanswersaj • Privacy violating Algorithm: Construct database c = {ci}1 ≤ i ≤ nby solving Linear Program: 0 ≤ ci ≤ 1 for 1 ≤ i ≤ n aj- ≤ i2qci ≤ aj+ for 1 ≤ j ≤ t • Round the solution: • if ci> 1/2 set to 1 and to 0 otherwise A solution must • exist: d itself

Bad solutions to LP do not survive A query q disqualifiesa potential database c 2 [0,1]nif its answer on q is more than 2far answer in d: |i2qci-i2qdi| > 2 • Idea: show that for a database c that is far away from d a random query disqualifiesc with some constant probability  • Want to use the Union Bound: all far away solutions are disqualified w.p. at least 1 – nn(1 - )t = 1–neg(n) How do we limit the solution space? Round each value to closest 1/n

Privacy requires Ω(√n) perturbation A query q disqualifies a potential database c 2[0,1]n if its answer on q is more than 2far answer in d: Lemma: if c isfar away from d, then a random query disqualifies c with some constant probability  • If Probi2 [n] [|di-ci| ¸1/3] > , then there is a >0 such that Probq2 {0,1}[n] [|i2q(ci – di)|¸ 2+1] >  Proof uses Azuma’s inequality

Privacy requires Ω(√n) perturbation Can discretize all potential databases c 2[0,1]n: Suppose we round each entry ci to closest fraction with denominator n: |ci – wi/n| · 1/n The response on q change by at most 1. • If we disqualify all `discrete’ databases then we also effectively eliminate all c 2 [0,1]n • There are nn `discrete’ databases

Privacy requires Ω(√n) perturbation A query q disqualifies a potential database c 2[0,1]n if its answer on q is more than 2far answer in d: Claim:ifc isfar away from d, then a random query disqualifies c with some constant probability  • Therefore: t = n log2 n queries leave a negligible probability for each far away reconstruction. • Union bound: all far away suggestions are disqualified w.p. at least 1 – nn(1 - )t = 1 – neg(n) Count number of entries far from d Can apply union bound by discretization

Review and Conclusion • When the perturbation is o(√n), choosing Õ(n) random queries gives enough information to efficiently reconstruct an o(n)-close db. • Database reconstructed using Linear programming – polynomial time. o(√n)databases are Blatantly Non-Private. • poly(n) time reconstructable

Composition Suppose we are going to apply a DP mechanism t times. • Perhaps on different databases Want to argue that result is differentially private • A value b2{0,1} is chosen • In each of the t rounds adversary A picks two adjacent databases D0iand D1iand receives result ziof an -DP mechanism Mi on Dbi • Want to argue A‘s view is within  for both values of b • A‘s view: (z1, z2, …, zt)plus randomness used.

Differential Privacy: Composition P[z1] = Pr z~A1(D)[z=z1] P’[z1] = Pr z~A1(D’)[z=z1] Handles auxiliary information Composes naturally • A1(D) is ε1-diffP • for all z1, A2(D,z1) is ε2-diffP, Then A2(D,A1(D)) is (ε1+ε2)-diffP Proof:for all adjacentD,D’ and (z1,z2):e-ε1 ≤ P[z1] /P’[z1]≤ eε1e-ε2 ≤ P[z2] /P’[z2]≤ eε2 e-(ε1+ε2) ≤ P[(z1,z2)]/P’[(z1,z2)]≤ eε1+ε2 P[z2] = Pr z~A2(D,z1)[z=z2] P’[z2] = Pr z~A2(D’,z1)[z=z2]

Differential Privacy: Composition • If all mechanisms Miare -DP, then for any view the probability that A gets the view when b=0 and when b=1 are with et Therefore results for a single query translate to results on several queries

Answering a single counting query Uset of(name,tag2 {0,1})tuples One counting query: #of participants with tag=1 Sanitizer A: output #of 1’s + noise Differentially private! If choose noise properly Choose noise from Laplace distribution

Laplacian Noise 0 -4 -3 -2 -1 1 2 3 4 5 Laplace distribution Y=Lap(b) has density function Pr[Y=y] =1/2b e-|y|/b Standard deviation: O(b) Take b=1/ε, get thatPr[Y=y] Çe-|y|

Laplacian Noise: ε-Privacy 0 -4 -3 -2 -1 1 2 3 4 5 Take b=1/ε, get thatPr[Y=y] Ç e-|y| Release: q(D) + Lap(1/ε) For adjacent D,D’:|q(D) – q(D’)| ≤ 1 For outputa:e- ≤Prby D[a]/Prby D’[a]≤ e

Laplacian Noise: ε-Privacy Theorem: the Laplace mechanism with parameter b=1/ is -differential private 0 -4 -3 -2 -1 1 2 3 4 5

Laplacian Noise: Õ(1/ε)-Error 0 -4 -3 -2 -1 1 2 3 4 5 Take b=1/ε, get thatPr[Y=y] Ç e-|y| Concentration of the Laplace distribution: Pry~Y[|y| > k·1/ε] = O(e-k) Setting k=O(log n) Expected error is 1/ε, w.h.p error is Õ(1/ε)

Foundations of Privacy Lecture 5