530 likes | 722 Views
Calibrating Noise to Sensitivity in Private Data Analysis. Kobbi Nissim BGU. With Cynthia Dwork, Frank McSherry, Adam Smith, Enav Weinreb. x 1. query. x 2. x 3. answer. . San. x n-1. x n. Users (government, researchers, marketers, … ). The Setting.
E N D
Calibrating Noise to Sensitivity in Private Data Analysis Kobbi Nissim BGU With Cynthia Dwork, Frank McSherry, Adam Smith, Enav Weinreb
x1 query x2 x3 answer San xn-1 xn Users (government, researchers, marketers, …) The Setting Can I combine these to learn some private info? I just want to learn a few harmless global statistics X = x Dn (n rows each of domain D)
What is privacy? • Clearly we cannot undo the harm done by others • Can we minimize the additional harm while providing utility? • Goal: Whether or not I contribute my data does not affect my privacy
x1 f x2 San x3 f(x) + noise xn-1 xn ¢ ¢ ¢ random coins Output Perturbation • San Controls: • which functions f • kind of perturbation
When Can I Release f(x) accurately? • Intuition: global information is “insensitive” to individual data and is safe • f(x1,…,xn) is sensitive if changing a few entries can drastically change its value
Talk Outline • A framework for output perturbation based on “sensitivity” • Formalize “sensitivity” and relate it to privacy definitions • Examples of sensitivity based analysis • New ideas • Basic models for privacy • Local vs. global • Noninteractive vs. Interactive
Related Work • Relevant work in Statistics, Data mining, Computer Security, Databases • Largely: no precise definitions and analysis of privacy • Recently: A foundational approach • [DN03,EGS03,DN04,BDMN05, KMN05 CDMSW05,CDMT05,MS06,CM06,…] • This work extends [DN03,DN04,BDMN05]
x1 transcript T(x) query 1 x2 San answer 1 x3 x= query T xn-1 answer T xn Distributions at “distance”< ¢ ¢ ¢ random coins transcript T(x') query 1 x1 San x2’ answer 1 x3 x’= query T Differ in 1 row answer T xn-1 xn ¢ ¢ ¢ random coins Privacy as Indistinguishability
-Indistinguishability A sanitizer is -indistinguishable if • for all pairs x,x’ Dn which differ on at most one entry • for all adversaries A • for all transcripts t Pr[TA(x) = t] e Pr[TA(x’) = t]
Semantically Flavored Definitions • Indistinguishability - easy to work with but does not directly say what the adversary can do an learn • “Ideal” semantic definition: • Adversary does not change his beliefs about me • Problem: dependencies, e.g. in form of side information • Say you know that I am 20 pounds heavier than average Israeli… • You will learn my weight from census results • Whether or not I participate • Ways to get around: • Assume “independence” of X1,…,Xn [DN03,DN04,BDMN05] • Compare “what A knows now” vs “what A would have learned anyway” [DM]
Incremental Risk • Suppose adversary has prior “beliefs” about x • Probability distribution, r.v. X= (X1,…,Xn) • Given transcript t, adversary updates “beliefs” according to Bayes’ rule • New distribution Xi’| T(X)=t
Two options: I participate in census (input = X) I do not participate (input Yi = X1,…,Xi-1,*,Xi+1,…,Xn ) Privacy: whether I participate or not does not significantly influence adversary’s posterior beliefs: For all transcripts t, for all i: X’i |T(X)=t¼ X’i |T(Yi)=t San San X Yi Incremental Risk Bugger! It’s the same whether you participate or not “Proof:” indistinguishability guarantees that updates are the same within 1±
Recall – -Indistinguishability • For all pairs x,x’ Dn s.t. dist(x,x’) = 1 • For all transcripts t Pr[TA(x) = t] e Pr[TA(x’) = t]
x1 x2 San Pls let me know fA(x)=iA xi x3 xn-1 fA(x) + noise xn ¢ ¢ ¢ random coins An Example – Sum Queries x[0,1]n
Sum Queries – Answering a Query • x2 [0,1]n fA(x)=iA xi • Can be used as a basis for other tasks: clustering, learning, classification… [BDMN05] • Answer: • xi + Y where Y » Lap(1/) • Laplace Distribution: • h(y) e-|y| • Note: |fA(x)-fA(x’)| 1
f(x’) Sum Queries – Proof of -Indistinguishability • Property of Lap • x,y: h(x)/h(y) e|x-y| • Pr[T(x)=t] e|fA(x)-t| • Pr[T(x’)=t] e|fA(x’)-t| • Pr[T(x)=t] / Pr[T(x’)=t] e | fA(x)- fA(x’)| e f(x) max |fA(x)-fA(x’)| = 1
We chose noise magnitude to cover for max |f(x)-f(x’)| Sensitivity Sf = max ||f(x)-f(x’)||1 Local Sensitivity LSf(x) = max ||f(x)-f(x’)||1 x1 f San x2 f(x) + noise x= x3 ¢ ¢ ¢ xn-1 f x1 San xn x2’ f(x’) + noise x3 x’= xn-1 xn ¢ ¢ ¢ Sensitivity dist(x,x’)=1 dist(x,x’)=1
x1 x2 San Pls let me know f(x) x3 xn-1 f(x) + Lap(Sf /) xn ¢ ¢ ¢ random coins Calibrating Noise to Sensitivity xDn h(y) e-/Sf ||y||1
h(y) e-/Sf ||y||1 Calibrating Noise to Sensitivity - Why it Works? • Sf = max |f(x)-f(x’)|1 • Property of Lap: x,y: h(x)/h(y) e||x-y||1 • Pr[T(x)=t] / Pr[T(x’)=t] e / Sf ||fA(x)- fA(x’)||1 e dist(x,x’)=1
Main Result Theorem: If a user U is limited to T adaptive queries of sensitivity Sf then -indistinguishability if iid noise Lap(SfT/)added to query answers • Same idea works with other metrics and noise • Which useful functions are insensitive? • All useful functions should be insensitive… • Statistical conclusions should not depend on small variations in data
Using insensitive functions • Strategies: • Use theorem, output f(x) + Lap(Sf /) • Sf may be hard to analyze/compute • Sf high for functions considered ‘insensitive’ • Express f in terms of insensitive functions • Resulting noise depends on input (in form and magnitude)
Example - Expressing f in terms of insensitive functions • x {0,1}n f(x) = ( xi)2 • Sf = n2 - (n-1)2 = 2n-1 • af = ( xi)2+ Lap(2n/) • If f(x) << n noise dominates • However • f(x) = (g(x))2 where g(x) = xi • Sg=1 • Better to query for g • Get ag = xi + Lap(1/) • Estimate f(x) as (ag)2 • Taking constant results in stddev O( xi) – (1/ )2
Useful Insensitive functions • Means, variances,… • With appropriate assumptions on data • Histograms & contingency tables • Singular value decomposition • Distance to a property • Functions with low query complexity
Histograms/Contingency Tables • x1,…,xn2 D where D partitioned into d disjoint bins b1,…,bd • h(x) = (v1,…,vd) where vi=|{i : xi bi}| • Sh = 2 • Changing one value xi changes vector by · 2 • Irrespective of d • Add Laplacian with std. dev. 2/ to each count Can do that with sum queries … b1 b2 … b4
P distance to P Distance to a Property • Say P = set of “good” databases • Distance to P = min # points in x that must be changed to make x in P • Always has sensitivity 1 • Add Laplacian with stdev 1/ • Examples: • Distance to being clusterable • Weight of minimum cut in graph x
Support of Ai(x)=Ai(x) p Approximations with Low Query Complexity • Lemma: • Assume algorithm A that randomly samples n points andPr[ A(x) f(x) ± ] > (1+)/2 • Then Sf· 2 • Proof: • Consider x,x’ that differ on point i • Let Ai be A conditioned on not choosing point i • Pr[Ai(x) f(x)± | pt i not sampled] > 1/2 • Pr[Ai(x’) f(x’)± | pt i not sampled] > 1/2 • point p that is within dist from both f(x),f(x’) Sf· 2
xi n 10 Local Sensitivity • Median – typically insensitive, large (global) sensitivity • LSf(x) = max ||f(x)-f(x’)||1 • Example: • f(x) = min(xi, 10) where xi{0,1} • LSf(x) = 1 if xi 10 and 0 otherwise dist(x,x’)=1
xi n 10 Local Sensitivity – First Attempt • Calibrate noise to LSf(x) • Answer query f by f(x) + Lap(LSf(x)/) • If x1…x10=1 and x11…xn=0 • Answer = 10 + Lap(1/) • If x1…x11=1 and x12…xn=0 • Answer = 10 Noise magnitude may be disclosive!
xi n 10 How to Calibrate Noise to Local Sensitivity? • Noise magnitude at a point x depends on LS(y) for all y Dn • N*f = max (LSf(y) e- dist(x,y)) • Median
Talk Outline • A framework for output perturbation based on “sensitivity” • Formalize “sensitivity” and relate it to privacy definitions • Examples of sensitivity based analysis • New ideas • Basic models for privacy • Local vs. global • Noninteractive vs. Interactive
Alice Users (government, researchers, marketers, …) Collection and sanitization Bob You Models for Data Privacy
San Alice San Collection and sanitization Bob San You Alice Collection and sanitization San Bob You Models for Data Privacy – Local vs. Global • Local: • Global: Including “SFE”
Alice Alice Collection and sanitization Bob Bob You You Collection and sanitization Models for Data Privacy –Interactive vs. Noninteractive • Interactive: • Noninteractive:
Models for Data Privacy - Summary • Local (vs. Global) • Non central trusted party • Individuals interact directly with (untrusted) user • Individuals control their own privacy • Noninteractive (vs. Interactive) • Easier distribution: web site, book, CD, … • More secure: can erase the data once it is processed • Almost all work in statistics, data mining is noninteractive!
Global, interactive incomparable ? Local, interactive Global, noninteractive Local, noninteractive Four Basic Models
Global, interactive Local, interactive Global, noninteractive Local, noninteractive Interactive vs. Noninteractive
Separating Interactive from Noninteractive • Random samples: can compute estimates for many stats • (essentially) no need to decide upon queries ahead of time • But not private (unless small domain, small sample [CM06]) • Interaction: get the power of random samples • With privacy! • E.g. Sum queries f(x) = i fi(xi) • Even chosen adaptively! • Noninteractive schemes seem weaker • Intuition: privacy cannot answer all questions ahead of time (e.g. [DN03]) • Intuition: sanitization must be tailored to specific functions
Separating Interactive from Noninteractive • Theorem: If D={0,1}d, then for any private, noninteractivescheme, many sum queries cannot be learned, unless d = o(log n) • Weaker than Interactive • Cannot emulate random sample if data is complex
Global, interactive Local, interactive Global, noninteractive Local, noninteractive Local vs. Global
Separating Local from Global • D = {0,1}d for d = (log n) • View x as an nd matrix • Local: rank(x) has sensitivity 1, can release with low noise • Global: cannot distinguish whether rank(x) = k or much larger than k • For suitable choice of d,n,k
To sum up • Defined privacy in terms of indistinguishability • Considered semantic versions of definitions • “Crypto” with non-negligible error • How to Calibrate noise to sensitivity and # of queries • Seems that useful stats should be insensitive • Some commonly used functions have low sensitivity • For others – local sensitivity? • Begun to explore the relationships between basic models
Questions • Which useful functions are insensitive? • What would you like to compute? • Can we get stronger results using: • Local sensitivity? • Computational assumptions? [MS06] • Entropy in data? • How to deal with small databases? • Privacy in a broader context • Rationalizing privacy and privacy related decisions • Which types of privacy? How to decide upon privacy parameters? … • Handling rich data • Audio, Video, Pictures, Text, …