1 / 42

Calibrating Noise to Sensitivity in Private Data Analysis

Calibrating Noise to Sensitivity in Private Data Analysis. Kobbi Nissim BGU. With Cynthia Dwork, Frank McSherry, Adam Smith, Enav Weinreb. x 1. query. x 2. x 3. answer. . San. x n-1. x n. Users (government, researchers, marketers, … ). The Setting.

vila
Download Presentation

Calibrating Noise to Sensitivity in Private Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Calibrating Noise to Sensitivity in Private Data Analysis Kobbi Nissim BGU With Cynthia Dwork, Frank McSherry, Adam Smith, Enav Weinreb

  2. x1 query x2 x3 answer  San xn-1 xn Users (government, researchers, marketers, …) The Setting Can I combine these to learn some private info? I just want to learn a few harmless global statistics X = x Dn (n rows each of domain D)

  3. What is privacy? • Clearly we cannot undo the harm done by others • Can we minimize the additional harm while providing utility? • Goal: Whether or not I contribute my data does not affect my privacy

  4. x1 f x2 San x3 f(x) + noise  xn-1 xn ¢ ¢ ¢ random coins Output Perturbation • San Controls: • which functions f • kind of perturbation

  5. When Can I Release f(x) accurately? • Intuition: global information is “insensitive” to individual data and is safe • f(x1,…,xn) is sensitive if changing a few entries can drastically change its value

  6. Talk Outline • A framework for output perturbation based on “sensitivity” • Formalize “sensitivity” and relate it to privacy definitions • Examples of sensitivity based analysis • New ideas • Basic models for privacy • Local vs. global • Noninteractive vs. Interactive

  7. Related Work • Relevant work in Statistics, Data mining, Computer Security, Databases • Largely: no precise definitions and analysis of privacy • Recently: A foundational approach • [DN03,EGS03,DN04,BDMN05, KMN05 CDMSW05,CDMT05,MS06,CM06,…] • This work extends [DN03,DN04,BDMN05]

  8. x1 transcript T(x) query 1 x2 San answer 1 x3 x= query T   xn-1 answer T xn Distributions at “distance”< ¢ ¢ ¢ random coins transcript T(x') query 1 x1 San x2’ answer 1 x3 x’= query T   Differ in 1 row answer T xn-1 xn ¢ ¢ ¢ random coins Privacy as Indistinguishability

  9. -Indistinguishability A sanitizer is -indistinguishable if • for all pairs x,x’  Dn which differ on at most one entry • for all adversaries A • for all transcripts t Pr[TA(x) = t]  e  Pr[TA(x’) = t]

  10. Semantically Flavored Definitions • Indistinguishability - easy to work with but does not directly say what the adversary can do an learn • “Ideal” semantic definition: • Adversary does not change his beliefs about me • Problem: dependencies, e.g. in form of side information • Say you know that I am 20 pounds heavier than average Israeli… • You will learn my weight from census results • Whether or not I participate • Ways to get around: • Assume “independence” of X1,…,Xn [DN03,DN04,BDMN05] • Compare “what A knows now” vs “what A would have learned anyway” [DM]

  11. Incremental Risk • Suppose adversary has prior “beliefs” about x • Probability distribution, r.v. X= (X1,…,Xn) • Given transcript t, adversary updates “beliefs” according to Bayes’ rule • New distribution Xi’| T(X)=t

  12. Two options: I participate in census (input = X) I do not participate (input Yi = X1,…,Xi-1,*,Xi+1,…,Xn ) Privacy: whether I participate or not does not significantly influence adversary’s posterior beliefs: For all transcripts t, for all i: X’i |T(X)=t¼ X’i |T(Yi)=t San San X Yi Incremental Risk Bugger! It’s the same whether you participate or not “Proof:” indistinguishability guarantees that updates are the same within 1± 

  13. Recall – -Indistinguishability • For all pairs x,x’  Dn s.t. dist(x,x’) = 1 • For all transcripts t Pr[TA(x) = t]  e  Pr[TA(x’) = t]

  14. x1 x2 San Pls let me know fA(x)=iA xi x3  xn-1 fA(x) + noise xn ¢ ¢ ¢ random coins An Example – Sum Queries x[0,1]n

  15. Sum Queries – Answering a Query • x2 [0,1]n fA(x)=iA xi • Can be used as a basis for other tasks: clustering, learning, classification… [BDMN05] • Answer: • xi + Y where Y » Lap(1/) • Laplace Distribution: • h(y)  e-|y| • Note: |fA(x)-fA(x’)|  1

  16. f(x’) Sum Queries – Proof of -Indistinguishability • Property of Lap •  x,y: h(x)/h(y)  e|x-y| • Pr[T(x)=t]  e|fA(x)-t| • Pr[T(x’)=t]  e|fA(x’)-t| • Pr[T(x)=t] / Pr[T(x’)=t]  e | fA(x)- fA(x’)|  e f(x) max |fA(x)-fA(x’)| = 1

  17. We chose noise magnitude to cover for max |f(x)-f(x’)| Sensitivity Sf = max ||f(x)-f(x’)||1 Local Sensitivity LSf(x) = max ||f(x)-f(x’)||1 x1 f San x2 f(x) + noise x= x3  ¢ ¢ ¢ xn-1 f x1 San xn x2’ f(x’) + noise x3 x’=  xn-1 xn ¢ ¢ ¢ Sensitivity dist(x,x’)=1 dist(x,x’)=1

  18. x1 x2 San Pls let me know f(x) x3  xn-1 f(x) + Lap(Sf /) xn ¢ ¢ ¢ random coins Calibrating Noise to Sensitivity xDn h(y)  e-/Sf ||y||1

  19. h(y)  e-/Sf ||y||1 Calibrating Noise to Sensitivity - Why it Works? • Sf = max |f(x)-f(x’)|1 • Property of Lap:  x,y: h(x)/h(y)  e||x-y||1 • Pr[T(x)=t] / Pr[T(x’)=t]  e / Sf ||fA(x)- fA(x’)||1 e dist(x,x’)=1

  20. Main Result Theorem: If a user U is limited to T adaptive queries of sensitivity Sf then -indistinguishability if iid noise Lap(SfT/)added to query answers • Same idea works with other metrics and noise • Which useful functions are insensitive? • All useful functions should be insensitive… • Statistical conclusions should not depend on small variations in data

  21. Using insensitive functions • Strategies: • Use theorem, output f(x) + Lap(Sf /) • Sf may be hard to analyze/compute • Sf high for functions considered ‘insensitive’ • Express f in terms of insensitive functions • Resulting noise depends on input (in form and magnitude)

  22. Example - Expressing f in terms of insensitive functions • x {0,1}n f(x) = ( xi)2 • Sf = n2 - (n-1)2 = 2n-1 • af = ( xi)2+ Lap(2n/) • If f(x) << n noise dominates • However • f(x) = (g(x))2 where g(x) =  xi • Sg=1 • Better to query for g • Get ag = xi + Lap(1/) • Estimate f(x) as (ag)2 • Taking  constant results in stddev O( xi) – (1/ )2

  23. Useful Insensitive functions • Means, variances,… • With appropriate assumptions on data • Histograms & contingency tables • Singular value decomposition • Distance to a property • Functions with low query complexity

  24. Histograms/Contingency Tables • x1,…,xn2 D where D partitioned into d disjoint bins b1,…,bd • h(x) = (v1,…,vd) where vi=|{i : xi bi}| • Sh = 2 • Changing one value xi changes vector by · 2 • Irrespective of d • Add Laplacian with std. dev. 2/ to each count Can do that with sum queries … b1 b2 … b4

  25. P distance to P Distance to a Property • Say P = set of “good” databases • Distance to P = min # points in x that must be changed to make x in P • Always has sensitivity 1 • Add Laplacian with stdev 1/ • Examples: • Distance to being clusterable • Weight of minimum cut in graph x

  26. Support of Ai(x)=Ai(x) p Approximations with Low Query Complexity • Lemma: • Assume algorithm A that randomly samples n points andPr[ A(x)  f(x) ± ] > (1+)/2 • Then Sf· 2 • Proof: • Consider x,x’ that differ on point i • Let Ai be A conditioned on not choosing point i • Pr[Ai(x)  f(x)±  | pt i not sampled] > 1/2 • Pr[Ai(x’)  f(x’)± | pt i not sampled] > 1/2 •  point p that is within dist  from both f(x),f(x’)  Sf· 2

  27.  xi n 10 Local Sensitivity • Median – typically insensitive, large (global) sensitivity • LSf(x) = max ||f(x)-f(x’)||1 • Example: • f(x) = min(xi, 10) where xi{0,1} • LSf(x) = 1 if xi  10 and 0 otherwise dist(x,x’)=1

  28.  xi n 10 Local Sensitivity – First Attempt • Calibrate noise to LSf(x) • Answer query f by f(x) + Lap(LSf(x)/) • If x1…x10=1 and x11…xn=0 • Answer = 10 + Lap(1/) • If x1…x11=1 and x12…xn=0 • Answer = 10 Noise magnitude may be disclosive!

  29.  xi n 10 How to Calibrate Noise to Local Sensitivity? • Noise magnitude at a point x depends on LS(y) for all y Dn • N*f = max (LSf(y) e- dist(x,y)) • Median

  30. Talk Outline • A framework for output perturbation based on “sensitivity” • Formalize “sensitivity” and relate it to privacy definitions • Examples of sensitivity based analysis • New ideas • Basic models for privacy • Local vs. global • Noninteractive vs. Interactive

  31. Alice Users (government, researchers, marketers, …) Collection and sanitization Bob You Models for Data Privacy

  32. San Alice San Collection and sanitization Bob San You Alice Collection and sanitization San Bob You Models for Data Privacy – Local vs. Global • Local: • Global: Including “SFE”

  33. Alice Alice Collection and sanitization Bob Bob You You Collection and sanitization Models for Data Privacy –Interactive vs. Noninteractive • Interactive: • Noninteractive:

  34. Models for Data Privacy - Summary • Local (vs. Global) • Non central trusted party • Individuals interact directly with (untrusted) user • Individuals control their own privacy • Noninteractive (vs. Interactive) • Easier distribution: web site, book, CD, … • More secure: can erase the data once it is processed • Almost all work in statistics, data mining is noninteractive!

  35. Global, interactive incomparable ? Local, interactive Global, noninteractive Local, noninteractive Four Basic Models

  36. Global, interactive Local, interactive Global, noninteractive Local, noninteractive Interactive vs. Noninteractive

  37. Separating Interactive from Noninteractive • Random samples: can compute estimates for many stats • (essentially) no need to decide upon queries ahead of time • But not private (unless small domain, small sample [CM06]) • Interaction: get the power of random samples • With privacy! • E.g. Sum queries f(x) = i fi(xi) • Even chosen adaptively! • Noninteractive schemes seem weaker • Intuition: privacy  cannot answer all questions ahead of time (e.g. [DN03]) • Intuition: sanitization must be tailored to specific functions

  38. Separating Interactive from Noninteractive • Theorem: If D={0,1}d, then for any private, noninteractivescheme, many sum queries cannot be learned, unless d = o(log n) • Weaker than Interactive • Cannot emulate random sample if data is complex

  39. Global, interactive Local, interactive Global, noninteractive Local, noninteractive Local vs. Global

  40. Separating Local from Global • D = {0,1}d for d = (log n) • View x as an nd matrix • Local: rank(x) has sensitivity 1, can release with low noise • Global: cannot distinguish whether rank(x) = k or much larger than k • For suitable choice of d,n,k

  41. To sum up • Defined privacy in terms of indistinguishability • Considered semantic versions of definitions • “Crypto” with non-negligible error • How to Calibrate noise to sensitivity and # of queries • Seems that useful stats should be insensitive • Some commonly used functions have low sensitivity • For others – local sensitivity? • Begun to explore the relationships between basic models

  42. Questions • Which useful functions are insensitive? • What would you like to compute? • Can we get stronger results using: • Local sensitivity? • Computational assumptions? [MS06] • Entropy in data? • How to deal with small databases? • Privacy in a broader context • Rationalizing privacy and privacy related decisions • Which types of privacy? How to decide upon privacy parameters? … • Handling rich data • Audio, Video, Pictures, Text, …

More Related