From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

From Idiosyncratic to Stereotypical:Toward Privacy in Public Databases Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee

Database Privacy • Census data – a prototypical example • Individuals provide information • Census bureau publishes sanitized records • Privacy is legally mandated; what utility can we achieve? • Our Goal: • What do we mean by preservation of privacy? • Characterize the trade-off between privacy and utility – disguise individual identifying information – preserve macroscopic properties • Develop a “good” sanitizing procedure with theoretical guarantees Shuchi Chawla

An outline of this talk • A mathematical formalism • What do we mean by privacy? • Prior work • An abstract model of datasets • Isolation; Good sanitizations • A candidate sanitization • A brief overview of results • General argument for privacy of n-point datasets • Open issues and concluding remarks Shuchi Chawla

Everybody’s First Suggestion • Learn the distribution, then output : • A description of the distribution, or, • Samples from the learned distribution • Want to reflect facts on the ground • Statistically insignificant clusters can be important for allocating resources Shuchi Chawla

Database Privacy • A long standing research problem – a wide variety of definitions and techniques • Statistical approaches • Alter the frequency (PRAN/DS/PERT) of particular features, while preserving means. • Additionally, erase values that reveal too much • Query-based approaches • Perturb output or disallow queries that breach privacy Shuchi Chawla

Privacy… a philosophical view-point • [Ruth Gavison] Privacy is protection from being brought to the attention of others • Attention invites further loss of privacy • Privacy is assured to the extent that one blends in with the crowd • Appealing definition; can be converted into a precise mathematical statement! Shuchi Chawla

What is a breach of privacy? • The statistical approach • Infering that database contains too few (· 3) people with a set of characteristics • The cryptographic approach • Guessing a value with high probability • Unsatisfying definitions • “Approximating” a real-valued attribute may be sufficient to breach privacy • A case of “one size fits all” • A combination of the two • Guessing enough attributes such that these together “match” few records Shuchi Chawla

A geometric view • Abstraction : • Points in a high dimensional metric space – say R d; drawn i.i.d. from some distribution • Points are unlabeled; you are your collection of attributes • Distance is everything • Real Database (RDB) – private n unlabeled points in d-dimensional space. • Sanitized Database (SDB) – public n’ new points possibly in a different space. Shuchi Chawla

The adversary or Isolator cd d q x (c-1) d • Using SDB and auxiliary information (AUX), outputs a point q • q “isolates” a real point x, if it is much closer to x than to x’s neighbors, • T-radius of x – distance to its T-nearest neighbor • x is “safe” if x > (T-radius of x)/(c-1) B(q, cdx) contains x’s entire T-neighborhood i.e., if B(q,cd) contains less than T points c – privacy parameter; eg. 4 large T and small c is good Shuchi Chawla

A good sanitization • Sanitizing algorithm compromises privacy if the adversary is able to increase his probability of isolating a point considerably by looking at its output • A rigorous definition I D aux z  x  I’ : | Pr[I(SDB,z) succeeds on x] – Pr[I’(z) succeeds on x] | is small • Definition of “small” can be forgiving, say, n-2 • Quantification over x : If aux reveals info about some x, the privacy of some other y should still be preserved • Provides a framework for describing the power of a sanitization method, and hence for comparisons Shuchi Chawla

The Sanitizer • The privacy of x is linked to its T-radius  Randomly perturb it in proportion to its T-radius • x’ = San(x) R B(x,T-rad(x)) • Intuition: • We are blending x in with its crowd If the number of dimensions (d) is large, there are “many” pre-images for x’. The adversary cannot conclusively pick any one. • We are adding random noise with mean zero to x, so several macroscopic properties should be preserved. Shuchi Chawla

Flavor of Results (Preliminary) • Assumptions Data arises from a mixture of Gaussians dimensions d,num of points n are large; d = w(log n) • Results Privacy: An adversary who knows the Gaussians and some auxiliary information cannot isolate any point with probability more than 2-W(d) Several special cases; General result not yet proved Very different proof techniques from anything in the statistics or crypto literatures! Utility:An honest user who does not know the Gaussians, can compute the means with a high probability Shuchi Chawla

Results on privacy.. An overview Shuchi Chawla

Results on utility… An overview Shuchi Chawla

A representative case - one sanitized point • RDB = {x1,…,xn} • The adversary is given n-1 real points x2,…,xn and one sanitized point x’1 ; T = 1; “flat” prior • Recall: x’1 2R B(x1,y) where y is the nearest neighbor of x1 • Main idea: Consider the posterior distribution on x1 Show that the adversary cannot isolate a large probability mass under this distribution Shuchi Chawla

A representative case - one sanitized point x3 Z Q∩Z x5 q x’ Q x2 x4 x6 • Let Z = { pR d | p is a legal pre-image for x’1 } Q = { p | if x1=p then x1 is isolated by q } • We show that Pr[ Q∩Z | x’1 ] ≤ 2-W(d) Pr[ Z | x’1 ] Pr[x1 in Q∩Z | x’1 ] = prob mass contribution from Q∩Z / contribution from Z = 21-d /(1/4) |p-q| · 1/3 |p-x’1| Shuchi Chawla

Contribution from Z x3 Z x5 x’ x2 x4 x6 • Pr[x1=p | x’1]  Pr[x’1 | x1=p]  1/rd (r = |x’1-p|) • Increase in r  x’1 gets randomized over a larger area – proportional to rd. Hence the inverse dependence. • Pr[x’1 | x12 S] sS 1/rd solid angle subtended at x’1 • Z subtends a solid angle equal to at least half a sphere at x’1 r S Shuchi Chawla

Contribution from Q Å Z x3 Z Q∩Z x5 q x’ Q x2 x4 x6 • The ellipsoid is roughly as far from x’1 as its longest radius • Contribution from ellipsoid is  2-d x total solid angle • Therefore, Pr[x1 2 QÅZ] / Pr[x1 2 Z]  2-d r r Shuchi Chawla

The general case… n sanitized points L R • Initial intuition is wrong: • Privacy of x1 given x1’ and all the other points in the clear does not imply privacy of x1 given x1’ and sanitizations of others! • Sanitization is non-oblivious – Other sanitized points reveal information about x, if x is their nearest neighbor • A possible approach: Decouple the two kinds of information – from x’ and x’i Shuchi Chawla

The general case… n sanitized points • Initial intuition is wrong: • Privacy of x1 given x1’ and all the other points in the clear does not imply privacy of x1 given x1’ and sanitizations of others! • Sanitization is non-oblivious – Other sanitized points reveal information about x, if x is their nearest neighbor • Where we are now • Consider some example of safe sanitization (not necessarily using perturbations) • Density regions? Histograms? • Relate perturbations to the safe sanitization Shuchi Chawla

Summary.. (1) Results on privacy Shuchi Chawla

Summary.. (2) Results on utility Shuchi Chawla

Future directions • Extend the privacy argument to other “nice”distributions • For what distributions is there no meaningful privacy—utility trade-off? • Can revealing the distribution hurt privacy? • Characterize the kind of auxiliary information that is acceptable • Depends on the distribution on the datapoints • Think of auxiliary information as an apriori distribution • Our proofs – full knowledge about some real points; no knowledge about others Shuchi Chawla

Future directions • The low-dimensional case • Is it inherently impossible? • Dinur & Nissim show impossibility for the 1-dimensional case • Discrete-valued attributes • Real world data is rarely real-valued • Our proofs require a “spread” in all attributes • Possible solution: convert binary values to probabilities (for example) • Can the adversary gain advantage from rounding off the values? • Extend the utility argument to other interesting macroscopic properties – e.g. correlations Shuchi Chawla

Questions? Shuchi Chawla

From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases