From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

From Idiosyncratic to Stereotypical:Toward Privacy in Public Databases Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee work done at Microsoft Research, SVC

Database Privacy • Census data – a prototypical example • Individuals provide information • Census bureau publishes sanitized records • Privacy is legally mandated; what utility can we achieve? • Inherent Privacy vs Utility trade-off • One extreme – complete privacy; no information • Other extreme – complete information; no privacy • Goals: • Find a middle path • preserve macroscopic properties • “disguise” individual identifying information • Change the nature of discourse • Establish framework for meaningful comparison of techniques Shuchi Chawla

Current solutions • Statistical approaches • Alter the frequency (PRAN/DS/PERT) of particular features, while preserving means. • Additionally, erase values that reveal too much • Query-based approaches • Perturb output or disallow queries that breach privacy • Unsatisfying • Overly constrained definitions; ad-hoc techniques • Ad-hoc treatment of external sources of info • Erasure can disclose information; Refusal to answer may be revelatory Shuchi Chawla

Our Approach • Crypto-flavored definitions • Mathematical characterization of Adversary’s goal • Precise definition of when sanitization procedure fails • Intuition: seeing sanitized DB gives Adversary an “advantage” • Statistical Techniques • Perturbation of attribute values • Differs from previous work: perturbation amounts depend on local densities of points • Highly abstracted version of problem • If we can’t understand this, we can’t understand real life. • If we get negative results here, the world is in trouble. Shuchi Chawla

An outline of this talk • A mathematical formalism • What do we mean by privacy? • An abstract model of datasets • Isolation • Good sanitizations • A candidate sanitization • Privacy for the 2-point case • General argument for privacy of n-point datasets • A brief overview of results • Open issues; moving on to real-world applications Shuchi Chawla

What do WE mean by privacy? • [Ruth Gavison] Protection from being brought to the attention of others • inherently valuable • attention invites further privacy loss • Privacy is assured to the extent that one blends in with the crowd • Appealing definition; can be converted into a precise mathematical statement… Shuchi Chawla

A geometric view • Abstraction : • Points in a high dimensional metric space – say R d; drawn i.i.d. from some distribution • Points are unlabeled; you are your collection of attributes • Distance is everything points are similar if and only if they are close (L2 norm) • Real Database (RDB) – private n unlabeled points in d-dimensional space. • Sanitized Database (SDB) – public n’ new points possibly in a different space. Shuchi Chawla

The adversary or Isolator • Using SDB and auxiliary information (AUX), outputs a point q • q “isolates” a real point x, if it is much closer to x than to x’s neighbors. • Even if q looks similar to x, it may fail to isolate x if it looks as similar to x’s neighbors as well. • Tightly clustered points have a smaller radius of isolation RDB Isolating Non-isolating Shuchi Chawla

The adversary or Isolator cd d q x (c-1) d • I(SDB,AUX) = q • x is isolated if B(q,cd) contains less than T points • T-radius of x – distance to its T-nearest neighbor • x is “safe” if x > (T-radius of x)/(c-1) B(q,cdx) contains x’s entire T-neighborhood c – privacy parameter; eg. 4 large T and small c is good Shuchi Chawla

A good sanitization • No way of obtaining privacy if AUX already reveals too much! • Sanitizing algorithm compromises privacy if the adversary is able to increase his probability of isolating a point considerably by looking at its output • Definition of “considerably” can be forgiving, say, n-2 • A rigorous definition • I D aux z  x  I’ | Pr[I(SDB,z) succeeds on x ] – Pr[I’(z) succeeds on x] | is small • Provides a framework for describing the power of a sanitization method, and hence for comparisons Shuchi Chawla

The Sanitizer • The privacy of x is linked to its T-radius  Randomly perturb it in proportion to its T-radius • x’ = San(x) R B(x,T-rad(x)) T=1 Shuchi Chawla

The Sanitizer • The privacy of x is linked to its T-radius  Randomly perturb it in proportion to its T-radius • x’ = San(x) R B(x,T-rad(x)) • Intuition: • We are blending x in with its crowd If the number of dimensions (d) is large, there are “many” pre-images for x’. The adversary cannot conclusively pick any one. • We are adding random noise with mean zero to x, so several macroscopic properties should be preserved. Shuchi Chawla

Flavor of Results (Preliminary) • Assumptions Data arises from a mixture of Gaussians dimensions d,num of points n are large; d = w(log n) • Results Privacy: An adversary who knows the Gaussians and some auxiliary information cannot isolate any point with probability more than 2-W(d) (Several special cases; General result not yet proved) Utility:An honest user who does not know the Gaussians, can compute the means with a high probability Shuchi Chawla

The “simplest” interesting case • RDB = {x, y} x, y 2R B(o,) where o – “origin” • T=1; c=4; SDB = { x’, y’ } • The adversary knows x’, y’, r and d= |x-y| • We show: There are m=2W(d) “decoy” pairs (xi,yi) • (xi,yi) are legal pre-images of (x’,y’) that is, |xi-yi|=d and Pr[ xi,yi | x’,y’ ] = Pr[ x,y | x’,y’ ] • Adversary cannot know which of the (xi, yi) represents reality • The adversary can only isolate one point in {x1,y1, … xm, ym} at a time Shuchi Chawla

The “simplest” interesting case x’ xH x yH y y’ H • Consider a hyperplane H through x’, y’ and o • xH, yH – mirror reflections of x, y through H Note: reflections preserve distances! • The world of xH, yH looks identical to the world of x, y Pr[ xH,yH | x’,y’ ] = Pr[ x,y | x’,y’ ] Shuchi Chawla

The “simplest” interesting case r x1 2q 2r sinq x x2 • Consider a hyperplane H through x’, y’ and o • xH, yH – mirror reflections of x, y through H Note: reflections preserve distances! • The world of xH, yH looks identical to the world of x, y • How many different H such that the corresponding xH are pairwise distant? Sufficient to pick r=2/3d and q = 30° Fact: There are 2W(d) vectors in d-dim, at angle 60° from each other.  Probability that adversary wins ≤ 2-W(d) = 2/3 d r Shuchi Chawla

The general case… n points • The adversary is given n-1 real points x2,…,xn and one sanitized point x’1 ; T = 1; flat prior • Reflections do not work – too many constraints • A more direct argument – examine posterior distribution on x1 • Let Z = { pR d | p is a legal pre-image for x’1 } Q = { p | if x1=p then x1 is isolated by q } • We show that Pr[ Q∩Z | x’1 ] ≤ 2-W(d) Pr[ Z | x’1 ] Pr[x1 in Q∩Z | x’1 ] = prob mass contribution from Q∩Z / contribution from Z = 21-d /(1/4) Shuchi Chawla

The general case… n points x3 Z Q∩Z x5 q x’ Q x2 x4 x6 Z = { p | p is a legal pre-image for x’1 } Q = { p | x1=p is isolated by q } • Key observation: • As |q-x’| increases, Q becomes larger. • But, larger distance from x’ implies smaller probability mass, because x is randomized over a larger area Probability depends only on the solid angle subtended at x’ Shuchi Chawla

The general case… n sanitized points L R • Privacy does not follow immediately from the previous analysis with real points! • Problem: Sanitization is non-oblivious Other sanitized points reveal information about x, if x is their nearest neighbor • Solution: Decouple the two kinds of information – from x’ and x’i Shuchi Chawla

The general case… n sanitized points L R • Claim 1 (Privacy for L): Given all sanitizations, all points in R, and all but one point in L, adversary cannot isolate last point Follows from the proof for n-1 real points • Claim 2 (Privacy for R): Given all sanitizations, all points in L and all but one point in R, adversary cannot isolate last point Work under progress Idea: Show that the adversary cannot distinguish between whether R contains some point x or not. (Information-theoretic argument) Shuchi Chawla

Results on privacy.. An overview Shuchi Chawla

Results on utility… An overview Skip Shuchi Chawla

Learning mixtures of Gaussians(Spectral methods) • Observation: Top eigenvectors of a matrix span a low-dimensional space that yields a good approximation of complex data sets, in particular Gaussian data. • Intuition • Sampled points are “close” to means of the corresponding Gaussians in any subspace • Span of top k singular vectors approximates span of the means • Distances between means of Gaussians are preserved • Other distances shrink by a factor of √(k/n) • Our goal: show that the same algorithm works for clustering sanitized data. Shuchi Chawla

Spectral techniques for perturbed data • A sanitized point is the sum of two Gaussian variables – sample + noise • w.h.p. the 1-radius of a point is less than the “radius” of its Gaussian • Variance of the noise is small • Sanitized points are still close to their means (uses independence of direction) • Span of top k singular vectors still approximates the span of means of Gaussians • Distances between means are preserved; others shrink Shuchi Chawla

Future directions • Extend the privacy argument to other “nice”distributions • Can revealing the distribution hurt privacy? • Characterize the kind of auxiliary information that is acceptable • Depends on the distribution on the datapoints • The low-dimensional case • Is it inherently impossible? • Dinur & Nissim show impossibility for the 1-dimensional case • Extend the utility argument to other interesting macroscopic properties Shuchi Chawla

What about the real world? • Lessons from the abstract model • High dimensionality is our friend • Gaussian/spherically symmetric perturbations seem to be the right thing to do • Need to scale different attributes appropriately, so that data is well rounded • Moving towards real data • Outliers – Our notion of c-isolation deals with them - Existence of outlier may be disclosed • Discrete attributes – Convert them into real-valued attributes - e.g. Convert a binary variable into a probability Shuchi Chawla

Questions? Shuchi Chawla

From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases