1 / 63

Privacy by Learning the Database

Privacy by Learning the Database. Moritz Hardt DIMACS, October 24, 2012. Isn’t privacy the opposite of learning the database?. Curator. query set Q. Analyst. privacy-preserving structure S accurate on Q. data set D = multi-set over universe U.

yen
Download Presentation

Privacy by Learning the Database

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Privacy by Learningthe Database Moritz Hardt DIMACS, October 24, 2012

  2. Isn’t privacy the opposite of learning the database?

  3. Curator query set Q Analyst privacy-preserving structure S accurate on Q data set D = multi-set over universe U

  4. Data set D as N-dimensional histogram where N=|U| D[i] = # elements in D of type i Normalized histogram = distribution over universe . . . . . . Statistical query q (aka linear/counting): Vector q in [0,1]N 1 1 2 2 3 3 N N 4 4 5 5 1 q(D) :=<q,D> 0 q(D) in [0,1]

  5. Why statistical queries? Lots of data analysis reduces to multiple statistical queries • Perceptron, ID3 decision trees, PCA/SVM, k-means clustering [BlumDworkMcSherryNissim’05] • Any SQ-learning algorithm [Kearns’98] • includes “most” known PAC-learning algorithms

  6. Curator’s wildest dream: This seems hard!

  7. Curator’s 2nd attempt: Intuition: Entropy implies privacy

  8. Two pleasant surprises Approximately solved by multiplicative weights update[Littlestone89,...] Can easily be made differentially private

  9. Why did learning theorists care to solve privacy problems 20 years ago? Answer: Entropy implies generalization

  10. Unknown concept example set Q Learner hypothesis h accurate on all examples Maximizing entropy implies hypothesis generalizes

  11. Privacy Learning Unknown concept Examples labeled by concept Hypothesis approximates target concept on examples Must Generalize Sensitive database Queries labeled by answer on DB Synopsis approximates DB on query set Must Preserve Privacy

  12. How can we solve this? Ellipsoid We’ll take a different route. Concave maximization s.t. linear constraints

  13. Start with uniform D0 “What’s wrong with it?” Query q violates constraint! Minimize entropy loss s.t. correction Closed form expression for Dt+1? Well...

  14. Relax Think Approximate Closed form expression for Dt+1? YES!

  15. Multiplicative Weights Update

  16. Dt D 1 . . . 0 1 2 3 N 4 5 At step t

  17. Dt D q 1 . . . 0 1 2 3 N 4 5 At step t Suppose q(Dt) < q(D)

  18. Dt D q 1 . . . 0 1 2 3 N 4 5 After step t

  19. Multiplicative Weights Update Algorithm: D0 uniform For t =1...T Find bad query q Dt+1 = Update(Dt,q) How quickly do we run out of bad queries?

  20. Progress Lemma: if q bad Put

  21. Progress Lemma: if q bad Facts: At most steps Error bound

  22. Algorithm: D0 uniform For t =1...T Find bad query q Dt+1 = Update(Dt,q) What about privacy? Only step that interactswith D

  23. Differential Privacy [Dwork-McSherry-Nissim-Smith-06] Two data sets D,D’ are called neighboring if they differ in one element. Definition (Differential Privacy): A randomized algorithm M(D) is called (ε,δ)-differentially private if for any two neighboring data sets D,D’ and all events S:

  24. Laplacian Mechanism [DMNS’06] Given query q: Compute q(D) Output q(D) + Lap(1/ε0n) Fact: Satisfies ε0-differential privacy Note: Sensitivity of q is 1/n

  25. Query selection |q(D)-q(Dt)| … q1 q2 q3 qk

  26. Query selection |q(D)-q(Dt)| Add Lap(1/ε0n) … q1 q2 q3 qk

  27. Query selection |q(D)-q(Dt)| … q1 q2 q3 qk Pick maximal violation

  28. Query selection |q(D)-q(Dt)| Lemma [McSherry-Talwar’07]: Selected index satisfies ε0-differential privacy and w.h.pViolation > … q1 q2 q3 qk Pick maximal violation

  29. Also use noisy answer in update rule Algorithm: D0 uniform For t =1...T Noisy selection of q Dt+1 = Update(Dt,q) Now: Each step satisfies ε0-differential privacy! New error bound: What is the total privacy guarantee?

  30. T-fold composition of ε0-differential privacy satisfies: Answer 1 [DMNS’06]: ε0T-differential privacy Answer 2 [DRV’10]: (ε,δ)-differential privacy Note: for small enough ε

  31. Theorem 1. On databases of size n MW achieves ε-differential privacy with ε,δ Composition Theorems Optimize T, ε0 Error bound Theorem 2. MW achieves (ε,δ)-differential privacy with Optimal dependence on |Q| and n

  32. Offline (non-interactive) Online (interactive) q1 Q a1 q2 S … a2 ✔ ? H-Ligett-McSherry12, Gupta-H-Roth-Ullman11 H-Rothblum10 See also: Roth-Roughgarden10, Dwork-Rothblum-Vadhan10, Dwork-Naor-Reingold-Rothblum-Vadhan09, Blum-Ligett-Roth08

  33. Private MW Online [H-Rothblum’10] Algorithm: Given query qt: • If |qt(Dt)- qt(D) | < α/2 + Lap(1/ε0n) • Output qt(Dt) • Otherwise • Output qt(D) + Lap(1/ε0n) • Dt+1 = Update(Dt,qt) Achieves same error bounds!

  34. Overview: Privacy Analysis • Offline setting: T << n steps • Simple analysis using Composition Theorems • Online setting: k >> n invocations of Laplace • Composition Thms don’t suggest small error! • Idea: Analyze privacy loss like lazy random walk (goes back to Dinur-Dwork-Nissim’03)

  35. Privacy Loss as a lazy random walk Number of Steps

  36. Privacy Loss as a lazy random walk Number of Steps Privacy loss

  37. Privacy Loss as a lazy random walk Number of Steps busy busy busy busy busy Privacy loss busy round = noisy answer close to forcing update lazy lazy lazy lazy lazy

  38. Privacy Loss as a lazy random walk Number of Steps busy busy busy busy busy Privacy loss 1 busy round = noisy answer close to forcing update lazy lazy lazy lazy lazy

  39. Privacy Loss as a lazy random walk Number of Steps busy busy busy busy busy Privacy loss 1 1 busy round = noisy answer close to forcing update lazy lazy lazy lazy lazy

  40. Privacy Loss as a lazy random walk Number of Steps busy busy busy busy busy Privacy loss 1 1 1 busy round = noisy answer close to forcing update lazy lazy lazy lazy lazy

  41. Privacy Loss as a lazy random walk Number of Steps busy busy busy busy busy Privacy loss 1 1 1 1 busy round = noisy answer close to forcing update lazy lazy lazy lazy lazy

  42. Privacy Loss as a lazy random walk Number of Steps busy busy busy busy busy Privacy loss 1 1 1 1 1 1 1 1 busy round = noisy answer close to forcing update lazy lazy lazy lazy lazy

  43. Privacy Loss as a lazy random walk Number of Steps busy busy busy busy busy Privacy loss 1 1 1 1 1 1 1 1 busy round = noisy answer close to forcing update W.h.p. bounded by O(sqrt(#busy)) lazy lazy lazy lazy lazy

  44. Formalizing the random walk Imagine output of PMW is 0/1 indicator vector where vt=1 if round t update, 0 otherwise Recall: Very few updates! Vector is sparse. Theorem: Vector v is (ε,δ)-diffpriv.

  45. Let D,D’ be neighboring DBs Let P,Q be corresponding output distributions Intution: X = privacy loss Approach: Sample v from P Consider X = log(P(v)/Q(v)) Argue Pr{ |X| > ε } ≤ δ Lemma: (3) implies (ε,δ)-diffpriv.

  46. Total privacy loss Privacy loss in round t DRV’10 We’ll show: Xt = 0 if t not busy |Xt| ≤ ε0 if t busy Number of busy rounds O(#updates) E[X1+...+Xk]≤ O(ε02#updates) Azuma Strong concentration around expectation

  47. Defining “busy” event Update condition: Busy event

  48. Offline (non-interactive) Online (interactive) q1 Q a1 q2 S … a2 ✔ ✔

  49. What we can do • Offline/batch setting: everyset of linear queries • Online/interactive setting: every sequence of adaptive and adversarial linear queries • Theoretical performance: Nearly optimal in the worst case • For instance-by-instance guarantee see H-Talwar10, Nikolov-Talwar (upcoming!), different techniques • Practical performance: Compares favorably to previous work! See Katrina’s talk. Are we done?

  50. What we would like to do Running time: Linear dependence on |U| |U|exponential in #attributes of data Can we get poly(n)? No, in the worst-case for synthetic data [DNRRV09] even for simple query classes [Ullman-Vadhan10] No, in interactive setting without restricting query class [Ullman12] What can we do about it?

More Related