280 likes | 431 Views
From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases. Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Hoeteck Wee. Database Privacy. Census data – a prototypical example Individuals provide information Census bureau publishes sanitized records
E N D
From Idiosyncratic to Stereotypical:Toward Privacy in Public Databases Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Hoeteck Wee
Database Privacy • Census data – a prototypical example • Individuals provide information • Census bureau publishes sanitized records • Privacy is legally mandated; what utility can we achieve? • Our Goal: • What do we mean by preservation of privacy? • Characterize the trade-off between privacy and utility – disguise individual identifying information – preserve macroscopic properties • Develop a “good” sanitizing procedure with theoretical guarantees Shuchi Chawla
An outline of this talk • A mathematical formalism • What do we mean by privacy? • Prior work • An abstract model of datasets • Isolation; Good sanitizations • A candidate sanitization • A brief overview of results • General argument for privacy of n-point datasets • Open issues and concluding remarks Shuchi Chawla
Privacy… a philosophical view-point • [Ruth Gavison] … includes protection from being brought to the attention of others … • Matches intuition; inherently desirable • Attention invites further loss of privacy • Privacy is assured to the extent that one blends in with the crowd • Appealing definition; can be converted into a precise mathematical statement! Shuchi Chawla
Database Privacy • Statistical approaches • Alter the frequency (PRAN/DS/PERT) of particular features, while preserving means. • Additionally, erase values that reveal too much • Query-based approaches • involve a permanent trusted third party • Query monitoring: dissallow queries that breach privacy • Perturbation: Add noise to the query output [Dinur Nissim’03, Dwork Nissim’04] • Statistical perturbation + adversarial analysis • [Evfimievsky et al ’03] combine statistical techniques with analysis similar to query-based approaches Shuchi Chawla
Everybody’s First Suggestion • Learn the distribution, then output: • A description of the distribution, or, • Samples from the learned distribution • Want to reflect facts on the ground • Statistically insignificant facts can be important for allocating resources Shuchi Chawla
Our Approach • Crypto-flavored definitions • Mathematical characterization of Adversary’s goal • Precise definition of when sanitization procedure fails • Intuition: seeing sanitized DB gives Adversary an “advantage” • Statistical Techniques • Perturbation of attribute values • Differs from previous work: perturbation amounts depend on local densities of points • Highly abstracted version of problem • If we can’t understand this, we can’t understand real life. • If we get negative results here, the world is in trouble. Shuchi Chawla
A geometric view • Abstraction : • Points in a high dimensional metric space – say R d; drawn i.i.d. from some distribution • Points are unlabeled; you are your collection of attributes • Distance is everything • Real Database (RDB) – private n unlabeled points in d-dimensional space. • Sanitized Database (SDB) – public n’ new points possibly in a different space. Shuchi Chawla
The adversary or Isolator • Using SDB and auxiliary information (AUX), outputs a point q • q “isolates” a real point x, if it is much closer to x than to x’s neighbors. • Even if q looks similar to x, it may fail to isolate x if it looks as similar to x’s neighbors as well. • Tightly clustered points have a smaller radius of isolation RDB Isolating Non-isolating Shuchi Chawla
The adversary or Isolator cd d q x (c-1) d • I(SDB,AUX) = q • x is isolated if B(q,cd) contains less than T points • T-radius of x – distance to its T-nearest neighbor • x is “safe” if x > (T-radius of x)/(c-1) B(q,cdx) contains x’s entire T-neighborhood c – privacy parameter; eg. 4 large T and small c is good Shuchi Chawla
A good sanitization • Sanitizing algorithm compromises privacy if the adversary is able to considerably increase his probability of isolating a point by looking at its output • A rigorous (and too ideal) definition D II ’ w.o.p RDB 2R Dnaux z x 2 RDB : | Pr[I(SDB,z) isolates x] – Pr[I ’(z) isolates x] |· /n • Definition of can be forgiving, say, 2-(d) or (1 in a 1000) • Quantification over x : If aux reveals info about some x, the privacy of some other y should still be preserved • Provides a framework for describing the power of a sanitization method, and hence for comparisons Shuchi Chawla
The Sanitizer • The privacy of x is linked to its T-radius Randomly perturb it in proportion to its T-radius • x’ = San(x) R S(x,T-rad(x)) T=1 Shuchi Chawla
The Sanitizer • The privacy of x is linked to its T-radius Randomly perturb it in proportion to its T-radius • x’ = San(x) R S(x,T-rad(x)) • Intuition: • We are blending x in with its crowd If the number of dimensions (d) is large, there are “many” pre-images for x’. The adversary cannot conclusively pick any one. • We are adding random noise with mean zero to x, so several macroscopic properties should be preserved. Shuchi Chawla
Results on privacy.. An overview Adversary is computationally unbounded Shuchi Chawla
Results on utility… An overview Shuchi Chawla
A special case - one sanitized point • RDB = {x1,…,xn} • The adversary is given n-1 real points x2,…,xn and one sanitized point x’1 ; T = 1; c=4; “flat” prior • Recall: x’1 2R S(x1,|x1-y|) where y is the nearest neighbor of x1 • Main idea: Consider the posterior distribution on x1 Show that the adversary cannot isolate a large probability mass under this distribution Shuchi Chawla
A special case - one sanitized point Z Q∩Z q Q x6 • Let Z = { pR d | p is a legal pre-image for x’1 } Q = { p | if x1=p then x1 is isolated by q } • We show that Pr[ Q∩Z | x’1 ] ≤ 2-W(d) Pr[ Z | x’1 ] Pr[x1 in Q∩Z | x’1 ] = prob mass contribution from Q∩Z / contribution from Z = 21-d /(1/4) |p-q| · 1/3 |p-x’1| x3 x5 x’1 x2 x4 Shuchi Chawla
Contribution from Z Z r x6 p • Pr[x1=p | x’1] Pr[x’1 | x1=p] 1/rd (r = |x’1-p|) • Increase in r x’1 gets randomized over a larger area – proportional to rd. Hence the inverse dependence. • Pr[x’1 | x12 S] sS 1/rd solid angle subtended at x’1 • Z subtends a solid angle equal to at least half a sphere at x’1 x3 x5 x’1 x2 S x4 Shuchi Chawla
Contribution from Q Å Z Z q Q x6 • The ellipsoid is roughly as far from x’1 as its longest radius • Contribution from ellipsoid is 2-d x total solid angle • Therefore, Pr[x1 2 QÅZ] / Pr[x1 2 Z] 2-d x3 Q∩Z x5 x’1 r r x2 x4 Shuchi Chawla
The general case… n sanitized points L R • Initial intuition is wrong: Privacy of x1 given x1’ and other points in the clear does not imply privacy of x1 given x1’ and sanitizations of others! • Problem: Sanitization is non-oblivious Other sanitized points reveal information about x, if x is their nearest neighbor • Solution: Decouple the two kinds of information – from x’ and x’i Shuchi Chawla
The general case… n sanitized points • Perturbation of L is a function of R • What function of R would reveal no information about R? • Answer: Coarse-grained histogram information! • Divide space into “cells” • Histogram count of cell C = number of points in RÅC • Perturbation radius of a point p / density of points in the cell containing p Shuchi Chawla
Histogram-based sanitization 2 0 2 0 2 2 4 1 2 3 5 0 2 • Recursively divide space into “cells” until all cells have few points • Reveal the EXACT count of points in each cell • Contrast this to k-anonymity T=6 Shuchi Chawla
Histogram-based sanitization q • Adversary outputs (q,r) guess and radius of isolation • Adversary wins if purple ball contains > 1 points and orange ball contains < T points Shuchi Chawla
Histogram-based sanitization q • We show: • If purple ball is “large”, then orange ball contains the parent cell => at least T points • If purple ball is “small”, then orange ball is exponentially larger than purple ball => either purple has < 1 points or orange has > T points Recall: cells are d-dimensional Shuchi Chawla
Results on privacy.. An overview Shuchi Chawla
Future directions • Extend the privacy argument to other “nice”distributions • For what distributions is there no meaningful privacy—utility trade-off? • Characterize acceptable auxiliary information • Think of auxiliary information as an a priori distribution • The low-dimensional case – Is it inherently impossible? • Discrete-valued attributes • Our proofs require a “spread” in all attributes • Extend the utility argument to other interesting macroscopic properties – e.g. correlations Shuchi Chawla
Conclusions • A first step towards understanding the privacy-utility trade-off • A general and rigorous definition of privacy • A work in progress! Shuchi Chawla
Questions? Shuchi Chawla