Toward Privacy in Public Databases

Toward Privacy in Public Databases Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee Work Done at Microsoft Research

Database Privacy • Think “Census” • Individuals provide information • Census Bureau publishes sanitized records • Privacy is legally mandated; what utility can we achieve? • Inherent Privacy vs Utility trade-off • One extreme – complete privacy; no information • Other extreme – complete information; no privacy • Goals: • Find a middle path • preserve macroscopic properties • “disguise” individual identifying information • Change the nature of discourse • Establish framework for meaningful comparison of techniques

Current solutions • Statistical approaches • Alter the frequency (PRAN/DS/PERT) of particular features, while preserving means. • Additionally, erase values that reveal too much • Query-based approaches • Disallow queries that reveal too much • Output perturbation (add noise to true answer) • Unsatisfying • Ad-hoc definitions of the privacy/breach • Erasure can disclose information • Noise can cancel (although, see work of Nissim+.) • Combinations of several seemingly innocuous queries could reveal information; refusal to answer can be revelatory

Everybody’s First Suggestion • Learn the distribution, then output • A description of the distribution, or • Samples from the learned distribution • Want to reflect facts on the ground • Statistically insignificant clusters can be important for allocating resources

Our Approach • Crypto-flavored definitions • Mathematical characterization of Adversary’s goal • Precise definition of when sanitization procedure fails • Intuition: seeing sanitized DB gives Adversary an “advantage” • Statistical Techniques • Perturbation of attribute values • Differs from previous work: perturbation amounts depend on local densities of points • Highly abstracted version of problem • If we can’t understand this, we can’t understand real life (and we can’t…) • If we get negative results here, the world is in trouble.

What do WE mean by privacy? • [Ruth Gavison] Protection from being brought to the attention of others • inherently valuable • attention invites further privacy loss • Privacy is assured to the extent that one blends in with the crowd • Appealing definition; can be converted into a precise mathematical statement…

A geometric view • Abstraction: • Database consists of points in high dimensional space Rd independent samples from some underlying distribution • Points are unlabeled you are your collection of attributes • Distance is everything points are similar if and only if they are close (L2 norm) • Real Database (RDB), private n unlabeled points in d-dimensional space • Sanitized Database (SDB), public n’ new points, possibly in a different space

The adversary or Isolator - Intuition • On input SDB and auxiliary information, adversary outputs a point q  Rd • q “isolates” a real DB point x, if it is much closer to xthan to x’s near neighbors • q fails to isolate x if q looks roughly as much like everyone inx’sneighborhood as it looks likexitself • Tightly clustered points have a smaller radius of isolation RDB

Isolation – the definition cd d q x (c-1) d • I(SDB,aux) = q • x is isolated if B(q,cd) contains fewer than T other points from RDB • T-radius of x – distance to its Tth-nearest neighbor • x is “safe” if x > (T-radius of x)/(c-1) B(q,cdx) contains x’s entire T-neighborhood c – privacy parameter; eg, 4 p If |x-p| < T-radx < (c-1)x then |q-p| · |q-x| +|x-p| < x+ T-radx< cx

Requirements for the sanitizer • No way of obtaining privacy if AUX already reveals too much! • Sanitization procedure compromises privacy if giving the adversary access to the SDB considerably increases its probability of success • Definition of “considerably” can be forgiving, say, n-2. • Made rigorous by quantification over adversaries, distributions, auxiliary information, sanitizations, samples: •  I  I’ w.o.p. D  aux z  x 2 D |Pr[I(SDB,z) isolates x] – Pr[I’(z) isolates x]| is small/n • Provides a framework for describing the power of a sanitization method, and hence for comparisons

The Sanitizer • The privacy of x is linked to its T-radius  Randomly perturb it in proportion to its T-radius • x’ = San(x) R B(x,T-rad(x)) • Intuition: • We are blending x in with its crowd • We are adding to x random noise with mean zero, so several macroscopic properties should be preserved.

Flavor of Results (Preliminary) • Assumptions Data arises from a mixture of Gaussians Dimension d,number of points n are large d = w(log n) • Results Privacy: An adversary who knows the Gaussians and some auxiliary information cannot isolate any point with probability more than 2-W(d) • several special cases; general result not yet proved; • Very different proof techniques from anything in the statistics or crypto literatures! Utility: A user who does not know the Gaussians can compute the means with a high probability.

The “simplest” interesting case • Two points – x and y – generated uniformly from surface of a ball B(o,r) • The adversary knows x’, y’, r and  = |x-y| • We prove there are 2W(d) “decoy” pairs (xi,yi) such that |xi-yi|=  and Pr[ xi,yi | x’,y’ ] = Pr[ x,y | x’,y’ ] • Furthermore, the adversary can only isolate one point xi or yi at a time: they are “far apart” wrt  Proof based on symmetry arguments and coding theory. High dimensionality crucial.

Finding Decoy Pairs x’ xH x yH y y’ H • Consider a hyperplane H through x’, y’ and o • xH, yH – mirror reflections of x, y through H Note: reflections preserve distances! • The world of xH, yH looks identical to the world of x, y Pr[ xH,yH | x’,y’ ] = Pr[ x,y | x’,y’ ]

Lots of choices for H r x1 2q 2r sinq x x2 • xH, yH – reflections of x, y through H(x’,y’,o) Note: reflections preserve distances! • The world of xH, yH looks identical to the world of x, y • How many different H such that the corresponding xH are pairwise distant (and distant from x)? Sufficient to pick r > 2/3d and q = 30° Fact: There are 2W(d) vectors in d-dim, at angle 60° from each other.  Probability that adversary wins ≤ 2-W(d) > 2/3 d r

Towards the general case… n points • The adversary is given n-1 real points x2,…,xn and one sanitized point x’1 • Symmetry does not work – too many constraints • A more direct argument – • Let Z = { pRd | p is a legal pre-image for x’1 } Q = {p | if x1 = p then x1 is isolated by q } • Show that Pr[x1 in Q∩Z | x’1 ] ≤ 2-W(d) Pr[x1 in Q∩Z | x’1 ] = prob mass contribution from Q∩Z / contribution from Z = 21-d /(1/4)

Why does Q∩Z contribute so little mass? x3 Z Q∩Z x5 q x1’ Q x2 x4 x6 Z = { p| p is a legal pre-image for x’1 } Q = { p | if x1 = p then x1 is isolated by q } T=1; perturb to 1-radius |x1’ – x1| = 1-rad(x1) • Key observation: • As |q-x1’| increases, Q becomes larger. • But, larger distance from x1’ implies smaller probability mass, as x1 is randomized over a larger area

The general case… n sanitized points • Initial intuition is wrong: • Privacy of x1 given x1’ and all the other points in the clear does not imply privacy of x1 given x1’ and sanitizations of others! • Sanitization of other points reveals information about x

Digression: Histogram Sanitization U = d-dim cube, side = 2 Cut into 2d subcubes split along each axis subcube has side = 1 For each subcube if number of RDB points > 2T then recurse Output: list of cells and counts

Digression: Histogram Sanitization • Theorem: If n = 2o(d) and points are drawn uniformly from U, then histogram sanitizations are safe with respect to 8-isolation: Pr[I(SDB) succeeds] · 2-(d). • Rough Intuition: For q 2 C: expected distance to any x 2 C is relatively large (and even larger for x 2 C’); distances tightly concentrated. Increasing radius by 8 captures almost all the parent cell, which contains at least 2T points.

Combining the Two Sanitizations A B • Partition RDB into two sets A and B • Cross-training • Compute histogram sanitization for B • v 2 A:v = side length of C containing v • Output GSan(v, v)

Cross-Training Privacy • Privacy for B: only histogram information about B is used • Privacy for A: enough variance for enough coordinates of v, even given C containing v and sanitization v’ of v.

Results on privacy.. The special Cases

Learning mixtures of Gaussians - Spectral techniques • Observation: Optimal low-rank approx to a matrix of complex data yields the underlying structure, eg, means [M01,VW02]. • We show that McSherry’s algorithm works for clustering sanitized Gaussian data original distribution (mixture of Gaussians) is recovered

Spectral techniques for perturbed data • A sanitized point is the sum of two Gaussian variables – sample + noise • w.h.p. the T-radius of a point is less than the “radius” of its Gaussian • Variance of the noise is small • Previous techniques work

Results on utility… An overview

What about the real world? • Lessons from the abstract model • High dimensionality is our friend • Gaussian perturbations seem to be the right thing to do • Need to scale different attributes appropriately, so that data is well rounded • Moving towards real data • Outliers – Our notion of c-isolation deals with them - Existence of outlier may be disclosed • Discrete attributes – Convert them into real-valued attributes - e.g. Convert a binary variable into a probability

Toward Privacy in Public Databases