390 likes | 534 Views
Foundations of Privacy Lecture 7. Lecturer: Moni Naor. Pr [response]. Z. Z. Z. Bad Responses:. ( , d ) - Differential Privacy . Sanitizer M gives (, d ) - differential privacy if: for all adjacent D 1 and D 2 , and all A µ range(M):
E N D
Foundations of PrivacyLecture 7 Lecturer:Moni Naor
Pr [response] Z Z Z Bad Responses: (, d)- Differential Privacy Sanitizer Mgives (, d) -differential privacy if: for all adjacentD1and D2, and all Aµrange(M): Pr[M(D1) 2A] ≤ ePr[M(D2) 2A]+ d ratio bounded This course: dnegligible Typical setting and negligible
Example: NO Differential Privacy U set of(name,tag 2{0,1})tuples One counting query: #of participants with tag=1 Sanitizer A: choose and release a few random tags Bad event T: Only my tag is 1, my tag released PrA[A(D+I)2T] ≥ 1/n PrA[A(D-I) 2 T] = 0 • Not ε diff private for any ε! • It is (0,1/n) Differential Private PrA[A(D+I) 2 T] ≤ eε≈1+ε e-ε≤ PrA[A(D-I) 2 T]
Counting Queries Databasexof sizen Counting-queries Qis a setof predicates q: U {0,1} Query: how manyx participants satisfy q? Relaxed accuracy: answer query withinαadditive errorw.h.p Not so bad:someerror anyway inherent in statistical analysis Queryq nindividuals, each contributing a single point in U U Sometimes talk about fraction
Bounds on Achievable Privacy Bounds on the • Accuracy • The responses from the mechanism to all queries are assured to be within α except with probability • Number of queries t for which we can receive accurate answers • The privacy parameter εfor which ε differential privacy is achievable • Or (ε,) differential privacy is achievable
Composition: t-Fold Suppose we are going to apply a DP mechanism t times. • Perhaps on different databases Want: the combined outcome is differentially private • A value b2{0,1} is chosen • In each of the t rounds: • adversary A picks two adjacent databases D0iand D1iand an -DP mechanism Mi • receives result ziof the -DP mechanism Mi on Dbi • Want to argue: A‘s view is within ’ for both values of b • A‘s view: (z1, z2, …, zt)plus randomness used.
Adversary’s view • A’s view: randomness +(z1, z2, …, zt) • Distribution with b: Vb A D01, D11 D02, D12 … D0t, D1t • M1(Db1) • M2(Db2) • Mt(Dbt) M1 M2 Mt z2 zt z1
Differential Privacy: Composition Last week: • If all mechanisms Miare -DP, then for any view the probability that A gets the view when b=0 and when b=1 are with et • treleases , each -DP, are t¢ -DP • Today: • treleases, each -DP, are (√t+t 2,)-DP (roughly) Therefore results for a single query translate to results on several queries
Privacy Loss as a Random Walk potentially dangerous rounds Number of Steps t Privacy loss 1 -1 1 1 -1 1 1 -1 grows as
The Exponential Mechanism [McSherryTalwar] A general mechanism that yields • Differential privacy • May yield utility/approximation • Is defined and evaluated by considering all possible answers The definition does not yield an efficient way of evaluating it Application/original motivation: Approximate truthfulness of auctions • Collusion resistance • Compatibility
Side bar: Digital Goods Auction • Some product with 0 cost of production • n individuals with valuation v1, v2, … vn • Auctioneer wants to maximize profit Key to truthfulness: what you say should not affect what you pay • What about approximate truthfulness?
Example of the Exponential Mechanism • Data: xi= website visited by student i today • Range: Y = {website names} • For each name y, let q(y, X) = #{i : xi = y} Goal: output the most frequently visited site • Procedure: Given X, Output website ywith probability proportional toeq(y,X) • Popular sites exponentially more likely than rare ones Website scores don’t change too quickly Size of subset
Setting • For input D 2Unwant to find r2R • Base measure on R - usually uniform • Score function w:Un £R R assigns any pair (D,r) a real value • Want to maximize it (approximately) The exponential mechanism • Assign output r2R with probability proportional to ew(D,r)(r) Normalizing factor rew(D,r)(r) The reals
The exponential mechanism is private • Let = maxD,D’,r |w(D,r)-w(D’,r)| Claim: The exponential mechanism yields a 2¢¢ differentially private solution For adjacent databases D and D’ and for all possible outputs r2R • Prob[output = r when input is D] = ew(D,r)(r)/rew(D,r)(r) • Prob[output = rwhen input is D’] = ew(D’,r)(r)/rew(D’,r)(r) sensitivity adjacent Ratio is bounded by e e
Laplace Noise as Exponential Mechanism • On query q:Un→R let w(D,r) = -|q(D)-r| • Prob noise = y e-y /2 ye-y = /2 e-y Laplace distribution Y=Lap(b) has density function Pr[Y=y] =1/2b e-|y|/b y 0 -4 -3 -2 -1 1 2 3 4 5
Any Differentially Private Mechanism is an instance of the Exponential Mechanism • Let M be a differentially private mechanism Take w(D,r) to be log(Prob[M(D) =r]) Remaining issue: Accuracy
Private Ranking • Each element i2 {1, … n} has a real valued score SD(i)based on a data set D. • Goal: Output k elements with highest scores. • Privacy • Data set D consists of n entries in domain D. • Differential privacy: Protects privacy of entries in D. • Condition: Insensitive Scores • for any element i, for any data sets D andD’ that differ in one entry: |SD(i)- SD’(i)| · 1
Approximate ranking • Let Sk be the kth highest score in on data set D. • An output list is -useful if: Soundness: No element in the output has score ·Sk - Completeness: Every element with score ¸Sk + is in the output. Score·Sk - Sk + ·Score Sk - ·Score·Sk +
Two Approaches Each input affects all scores • Score perturbation • Perturb the scores of the elements with noise • Pick the top k elements in terms of noisy scores. • Fast and simple implementation Question: what sort of noise should be added? What sort of guarantees? • Exponential sampling • Run the exponential mechanism k times. • more complicated and slower implementation What sort of guarantees?
Exponential Mechanism: Simple Example (almost free) private lunch Database of n individuals, lunch options {1…k},each individual likes or dislikes each option (1 or 0) Goal: output a lunch option that many like For each lunch option j2[k], ℓ(j) is # of individuals who like j Exponential Mechanism:Output j with probability eεℓ(j) Actual probability: eεℓ(j)/(∑ieεℓ(i)) Normalizer
The Net Mechanism • Idea: limit the number of possible outputs • Want |R| to be small • Why is it good? • The good (accurate) output has to compete with a few possible outputs • If there is a guarantee that there is at least one good output, then the total weight of the bad outputs is limited
Nets A collection N of databases is called an -net of databases for a class of queries C if: • for all possible databases x there exists a y2Nsuch that Maxq2C |q(x) –q(y)| · If we use the closest member of N instead of the real database lose at most In terms of worst query
The Net Mechanism For a class of queries C, privacy and accuracy , on data base x • Let N be an -net for the class of queries C • Let w(x,y) = - Maxq2C|q(x) –q(y)| • Sample and output according to exponential mechanism with x, w, and R=N • For y2N: Prob[y] proportional to ew(x,y)
Privacy and Utility Claims Privacy: the net mechanism is ¢ differentially private Utility: the net mechanism is (2, ) accurate for any and such that • ¸2/ ¢ log (|N|/) Proof: • there is at least one good solution: gets weight at least e- • there are at most |N| (bad) outputs: each get weight at most e-2 • Use the Union Bound Accuracy less than 2
Synthetic DB: Output is a DB ? answer 1 answer 3 answer 2 Sanitizer query 1,query 2,. . . Database Synthetic DB: output is always a DB • Of entries from same universe U • User reconstructs answers to queries by evaluating the query on output DB Software and people compatible Consistent answers
Counting Queries DatabaseDof sizen • Queries with low sensitivity Counting-queries Cis a setof predicates c: U {0,1} Query: how many D participants satisfy c ? Relaxed accuracy: answer query withinαadditive errorw.h.p Not so bad:error anyway inherent in statistical analysis Assume all queries given in advance Query c U Non-interactive
-Net For Counting Queries If we want to answer many counting queriesCwith differential privacy: Sufficient to come up with an -Net for C Resulting accuracy max{, log (|N|/)/ } Claim: consider the set N consisting of all databases of size m where m = log|C|/2 Consider each element in the set to have weight n/m Then N is an -Net for any collection C of counting queries Error is Õ(n2/3 log|C|)
Remarkable Hope for rich private analysis of small DBs! • Quantitative: #queries >> DB size, • Qualitative: output of sanitizer -synthetic DB-output is a DB itself
The BLR Algorithm For DBs F and Ddist(F,D) = maxq2C |q(F) – q(D)| Intuition: far away DBs get smaller probability Blum Ligett Roth 2008 Algorithm on input DB D: Sample from a distribution on DBs of size m: (m < n)DBF gets picked w.p. /e-ε·dist(F,D)
The BLR Algorithm Idea: • In general: Do not use large DB • Sample and answer accordingly • DB of size m guaranteeing hitting each query with sufficient accuracy
The BLR Algorithm: Error Õ(n2/3 log|C|) Goodness Lemma: there exists Fgood of size m=Õ((n\α)2·log|C|) s.t. dist(Fgood,D) ≤α Proof: construct member of by Fgoodtaking m random samples from U Algorithm on input DB D: Sample from a distribution on DBs of size m: (m < n)DBF gets picked w.p. /e-ε·dist(F,D)
The BLR Algorithm: Error Õ(n2/3 log|C|) Goodness Lemma: there exists Fgood of size m=Õ((n\α)2·log|C|) s.t. dist(Fgood,D) ≤α Pr [Fgood] ~ e-εα For any Fbad with dist2α,Pr [Fbad] ~ e-2εα Union bound: ∑bad DB FbadPr [Fbad]~ |U|me-2εα For α=Õ(n2/3log|C|), Pr [Fgood] >> ∑ Pr [Fbad] Algorithm on input DB D: Sample from a distribution on DBs of size m: (m < n)DBF gets picked w.p. /e-ε·dist(F,D)
The BLR Algorithm: 2ε-Privacy For adjacent D,D’ for every F|dist(F,D) – dist(F,D’)| ≤ 1 Probability ofFby D:e-ε·dist(F,D)/∑G of size m e-ε·dist(G,D) Probability of F by D’:numerator and denominator can change by eε-factor ) 2ε-privacy Algorithm on input DB D: Sample from a distribution on DBs of size m: (m < n) DB Fgets picked w.p. / e-ε·dist(F,D)
The BLR Algorithm: Running Time Generating the distribution by enumeration:Need to enumerate every size-m database,where m= Õ((n\α)2·log|C|) Running time ≈|U|Õ((n\α)2·log|c|) Algorithm on input DB D: Sample from a distribution on DBs of size m: (m < n) DB Fgets picked w.p. /e-ε·dist(F,D)
Conclusion Offline algorithm, 2ε-Differential Privacy for anyset C of counting queries Error α is Õ(n2/3 log|C|/ε) Super-poly running time: |U|Õ((n\α)2·log|C|)
Maintaining State Queryq State = Distribution D
The Multiplicative Weights Algorithm • Powerful tool in algorithms design • Learn a Probability Distribution iteratively • In each round: • either current distribution is good • or get a lot of information on distribution • Update distribution
The PMW Algorithm This is the state. Is completely public! Maintain a distribution D on universe U Initialize Dto be uniform on U Repeat up to ktimes • Set ÃT + Lap() • Repeat while no update occurs: • Receive query q 2Q • Let = x(q) + Lap() • Test: If |q(D)- | ·outputq(D). • Else (update): • Output • Update D[i] /D[i] e±T/4q[i]and re-weight. Algorithm fails if more than k updates The true value the plus or minus are according to the sign of the error
Overview: Privacy Analysis For the query family Q = {0,1}U for (,d,)and t the PMW mechanism is • (,d) –differentially private • (,) accurate for up to t queries where = Õ(1/( n)1/2) • State = Distribution is privacy preserving for individuals (but not for queries) accuracy Log dependency on |U|, d, and t