1 / 25

From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Explore the trade-off between privacy and utility in public databases, with a focus on disguising individual identifying information and preserving macroscopic properties.

robbiem
Download Presentation

From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Idiosyncratic to Stereotypical:Toward Privacy in Public Databases Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee

  2. Database Privacy • Census data – a prototypical example • Individuals provide information • Census bureau publishes sanitized records • Privacy is legally mandated; what utility can we achieve? • Our Goal: • What do we mean by preservation of privacy? • Characterize the trade-off between privacy and utility – disguise individual identifying information – preserve macroscopic properties • Develop a “good” sanitizing procedure with theoretical guarantees Shuchi Chawla

  3. An outline of this talk • A mathematical formalism • What do we mean by privacy? • Prior work • An abstract model of datasets • Isolation; Good sanitizations • A candidate sanitization • A brief overview of results • General argument for privacy of n-point datasets • Open issues and concluding remarks Shuchi Chawla

  4. Everybody’s First Suggestion • Learn the distribution, then output : • A description of the distribution, or, • Samples from the learned distribution • Want to reflect facts on the ground • Statistically insignificant clusters can be important for allocating resources Shuchi Chawla

  5. Database Privacy • A long standing research problem – a wide variety of definitions and techniques • Statistical approaches • Alter the frequency (PRAN/DS/PERT) of particular features, while preserving means. • Additionally, erase values that reveal too much • Query-based approaches • Perturb output or disallow queries that breach privacy Shuchi Chawla

  6. Privacy… a philosophical view-point • [Ruth Gavison] Privacy is protection from being brought to the attention of others • Attention invites further loss of privacy • Privacy is assured to the extent that one blends in with the crowd • Appealing definition; can be converted into a precise mathematical statement! Shuchi Chawla

  7. What is a breach of privacy? • The statistical approach • Infering that database contains too few (· 3) people with a set of characteristics • The cryptographic approach • Guessing a value with high probability • Unsatisfying definitions • “Approximating” a real-valued attribute may be sufficient to breach privacy • A case of “one size fits all” • A combination of the two • Guessing enough attributes such that these together “match” few records Shuchi Chawla

  8. A geometric view • Abstraction : • Points in a high dimensional metric space – say R d; drawn i.i.d. from some distribution • Points are unlabeled; you are your collection of attributes • Distance is everything • Real Database (RDB) – private n unlabeled points in d-dimensional space. • Sanitized Database (SDB) – public n’ new points possibly in a different space. Shuchi Chawla

  9. The adversary or Isolator cd d q x (c-1) d • Using SDB and auxiliary information (AUX), outputs a point q • q “isolates” a real point x, if it is much closer to x than to x’s neighbors, • T-radius of x – distance to its T-nearest neighbor • x is “safe” if x > (T-radius of x)/(c-1) B(q, cdx) contains x’s entire T-neighborhood i.e., if B(q,cd) contains less than T points c – privacy parameter; eg. 4 large T and small c is good Shuchi Chawla

  10. A good sanitization • Sanitizing algorithm compromises privacy if the adversary is able to increase his probability of isolating a point considerably by looking at its output • A rigorous definition I D aux z  x  I’ : | Pr[I(SDB,z) succeeds on x] – Pr[I’(z) succeeds on x] | is small • Definition of “small” can be forgiving, say, n-2 • Quantification over x : If aux reveals info about some x, the privacy of some other y should still be preserved • Provides a framework for describing the power of a sanitization method, and hence for comparisons Shuchi Chawla

  11. The Sanitizer • The privacy of x is linked to its T-radius  Randomly perturb it in proportion to its T-radius • x’ = San(x) R B(x,T-rad(x)) • Intuition: • We are blending x in with its crowd If the number of dimensions (d) is large, there are “many” pre-images for x’. The adversary cannot conclusively pick any one. • We are adding random noise with mean zero to x, so several macroscopic properties should be preserved. Shuchi Chawla

  12. Flavor of Results (Preliminary) • Assumptions Data arises from a mixture of Gaussians dimensions d,num of points n are large; d = w(log n) • Results Privacy: An adversary who knows the Gaussians and some auxiliary information cannot isolate any point with probability more than 2-W(d) Several special cases; General result not yet proved Very different proof techniques from anything in the statistics or crypto literatures! Utility:An honest user who does not know the Gaussians, can compute the means with a high probability Shuchi Chawla

  13. Results on privacy.. An overview Shuchi Chawla

  14. Results on utility… An overview Shuchi Chawla

  15. A representative case - one sanitized point • RDB = {x1,…,xn} • The adversary is given n-1 real points x2,…,xn and one sanitized point x’1 ; T = 1; “flat” prior • Recall: x’1 2R B(x1,y) where y is the nearest neighbor of x1 • Main idea: Consider the posterior distribution on x1 Show that the adversary cannot isolate a large probability mass under this distribution Shuchi Chawla

  16. A representative case - one sanitized point x3 Z Q∩Z x5 q x’ Q x2 x4 x6 • Let Z = { pR d | p is a legal pre-image for x’1 } Q = { p | if x1=p then x1 is isolated by q } • We show that Pr[ Q∩Z | x’1 ] ≤ 2-W(d) Pr[ Z | x’1 ] Pr[x1 in Q∩Z | x’1 ] = prob mass contribution from Q∩Z / contribution from Z = 21-d /(1/4) |p-q| · 1/3 |p-x’1| Shuchi Chawla

  17. Contribution from Z x3 Z x5 x’ x2 x4 x6 • Pr[x1=p | x’1]  Pr[x’1 | x1=p]  1/rd (r = |x’1-p|) • Increase in r  x’1 gets randomized over a larger area – proportional to rd. Hence the inverse dependence. • Pr[x’1 | x12 S] sS 1/rd solid angle subtended at x’1 • Z subtends a solid angle equal to at least half a sphere at x’1 r S Shuchi Chawla

  18. Contribution from Q Å Z x3 Z Q∩Z x5 q x’ Q x2 x4 x6 • The ellipsoid is roughly as far from x’1 as its longest radius • Contribution from ellipsoid is  2-d x total solid angle • Therefore, Pr[x1 2 QÅZ] / Pr[x1 2 Z]  2-d r r Shuchi Chawla

  19. The general case… n sanitized points L R • Initial intuition is wrong: • Privacy of x1 given x1’ and all the other points in the clear does not imply privacy of x1 given x1’ and sanitizations of others! • Sanitization is non-oblivious – Other sanitized points reveal information about x, if x is their nearest neighbor • A possible approach: Decouple the two kinds of information – from x’ and x’i Shuchi Chawla

  20. The general case… n sanitized points • Initial intuition is wrong: • Privacy of x1 given x1’ and all the other points in the clear does not imply privacy of x1 given x1’ and sanitizations of others! • Sanitization is non-oblivious – Other sanitized points reveal information about x, if x is their nearest neighbor • Where we are now • Consider some example of safe sanitization (not necessarily using perturbations) • Density regions? Histograms? • Relate perturbations to the safe sanitization Shuchi Chawla

  21. Summary.. (1) Results on privacy Shuchi Chawla

  22. Summary.. (2) Results on utility Shuchi Chawla

  23. Future directions • Extend the privacy argument to other “nice”distributions • For what distributions is there no meaningful privacy—utility trade-off? • Can revealing the distribution hurt privacy? • Characterize the kind of auxiliary information that is acceptable • Depends on the distribution on the datapoints • Think of auxiliary information as an apriori distribution • Our proofs – full knowledge about some real points; no knowledge about others Shuchi Chawla

  24. Future directions • The low-dimensional case • Is it inherently impossible? • Dinur & Nissim show impossibility for the 1-dimensional case • Discrete-valued attributes • Real world data is rarely real-valued • Our proofs require a “spread” in all attributes • Possible solution: convert binary values to probabilities (for example) • Can the adversary gain advantage from rounding off the values? • Extend the utility argument to other interesting macroscopic properties – e.g. correlations Shuchi Chawla

  25. Questions? Shuchi Chawla

More Related