630 likes | 781 Views
Privacy by Learning the Database. Moritz Hardt DIMACS, October 24, 2012. Isn’t privacy the opposite of learning the database?. Curator. query set Q. Analyst. privacy-preserving structure S accurate on Q. data set D = multi-set over universe U.
E N D
Privacy by Learningthe Database Moritz Hardt DIMACS, October 24, 2012
Isn’t privacy the opposite of learning the database?
Curator query set Q Analyst privacy-preserving structure S accurate on Q data set D = multi-set over universe U
Data set D as N-dimensional histogram where N=|U| D[i] = # elements in D of type i Normalized histogram = distribution over universe . . . . . . Statistical query q (aka linear/counting): Vector q in [0,1]N 1 1 2 2 3 3 N N 4 4 5 5 1 q(D) :=<q,D> 0 q(D) in [0,1]
Why statistical queries? Lots of data analysis reduces to multiple statistical queries • Perceptron, ID3 decision trees, PCA/SVM, k-means clustering [BlumDworkMcSherryNissim’05] • Any SQ-learning algorithm [Kearns’98] • includes “most” known PAC-learning algorithms
Curator’s wildest dream: This seems hard!
Curator’s 2nd attempt: Intuition: Entropy implies privacy
Two pleasant surprises Approximately solved by multiplicative weights update[Littlestone89,...] Can easily be made differentially private
Why did learning theorists care to solve privacy problems 20 years ago? Answer: Entropy implies generalization
Unknown concept example set Q Learner hypothesis h accurate on all examples Maximizing entropy implies hypothesis generalizes
Privacy Learning Unknown concept Examples labeled by concept Hypothesis approximates target concept on examples Must Generalize Sensitive database Queries labeled by answer on DB Synopsis approximates DB on query set Must Preserve Privacy
How can we solve this? Ellipsoid We’ll take a different route. Concave maximization s.t. linear constraints
Start with uniform D0 “What’s wrong with it?” Query q violates constraint! Minimize entropy loss s.t. correction Closed form expression for Dt+1? Well...
Relax Think Approximate Closed form expression for Dt+1? YES!
Dt D 1 . . . 0 1 2 3 N 4 5 At step t
Dt D q 1 . . . 0 1 2 3 N 4 5 At step t Suppose q(Dt) < q(D)
Dt D q 1 . . . 0 1 2 3 N 4 5 After step t
Multiplicative Weights Update Algorithm: D0 uniform For t =1...T Find bad query q Dt+1 = Update(Dt,q) How quickly do we run out of bad queries?
Progress Lemma: if q bad Put
Progress Lemma: if q bad Facts: At most steps Error bound
Algorithm: D0 uniform For t =1...T Find bad query q Dt+1 = Update(Dt,q) What about privacy? Only step that interactswith D
Differential Privacy [Dwork-McSherry-Nissim-Smith-06] Two data sets D,D’ are called neighboring if they differ in one element. Definition (Differential Privacy): A randomized algorithm M(D) is called (ε,δ)-differentially private if for any two neighboring data sets D,D’ and all events S:
Laplacian Mechanism [DMNS’06] Given query q: Compute q(D) Output q(D) + Lap(1/ε0n) Fact: Satisfies ε0-differential privacy Note: Sensitivity of q is 1/n
Query selection |q(D)-q(Dt)| … q1 q2 q3 qk
Query selection |q(D)-q(Dt)| Add Lap(1/ε0n) … q1 q2 q3 qk
Query selection |q(D)-q(Dt)| … q1 q2 q3 qk Pick maximal violation
Query selection |q(D)-q(Dt)| Lemma [McSherry-Talwar’07]: Selected index satisfies ε0-differential privacy and w.h.pViolation > … q1 q2 q3 qk Pick maximal violation
Also use noisy answer in update rule Algorithm: D0 uniform For t =1...T Noisy selection of q Dt+1 = Update(Dt,q) Now: Each step satisfies ε0-differential privacy! New error bound: What is the total privacy guarantee?
T-fold composition of ε0-differential privacy satisfies: Answer 1 [DMNS’06]: ε0T-differential privacy Answer 2 [DRV’10]: (ε,δ)-differential privacy Note: for small enough ε
Theorem 1. On databases of size n MW achieves ε-differential privacy with ε,δ Composition Theorems Optimize T, ε0 Error bound Theorem 2. MW achieves (ε,δ)-differential privacy with Optimal dependence on |Q| and n
Offline (non-interactive) Online (interactive) q1 Q a1 q2 S … a2 ✔ ? H-Ligett-McSherry12, Gupta-H-Roth-Ullman11 H-Rothblum10 See also: Roth-Roughgarden10, Dwork-Rothblum-Vadhan10, Dwork-Naor-Reingold-Rothblum-Vadhan09, Blum-Ligett-Roth08
Private MW Online [H-Rothblum’10] Algorithm: Given query qt: • If |qt(Dt)- qt(D) | < α/2 + Lap(1/ε0n) • Output qt(Dt) • Otherwise • Output qt(D) + Lap(1/ε0n) • Dt+1 = Update(Dt,qt) Achieves same error bounds!
Overview: Privacy Analysis • Offline setting: T << n steps • Simple analysis using Composition Theorems • Online setting: k >> n invocations of Laplace • Composition Thms don’t suggest small error! • Idea: Analyze privacy loss like lazy random walk (goes back to Dinur-Dwork-Nissim’03)
Privacy Loss as a lazy random walk Number of Steps
Privacy Loss as a lazy random walk Number of Steps Privacy loss
Privacy Loss as a lazy random walk Number of Steps busy busy busy busy busy Privacy loss busy round = noisy answer close to forcing update lazy lazy lazy lazy lazy
Privacy Loss as a lazy random walk Number of Steps busy busy busy busy busy Privacy loss 1 busy round = noisy answer close to forcing update lazy lazy lazy lazy lazy
Privacy Loss as a lazy random walk Number of Steps busy busy busy busy busy Privacy loss 1 1 busy round = noisy answer close to forcing update lazy lazy lazy lazy lazy
Privacy Loss as a lazy random walk Number of Steps busy busy busy busy busy Privacy loss 1 1 1 busy round = noisy answer close to forcing update lazy lazy lazy lazy lazy
Privacy Loss as a lazy random walk Number of Steps busy busy busy busy busy Privacy loss 1 1 1 1 busy round = noisy answer close to forcing update lazy lazy lazy lazy lazy
Privacy Loss as a lazy random walk Number of Steps busy busy busy busy busy Privacy loss 1 1 1 1 1 1 1 1 busy round = noisy answer close to forcing update lazy lazy lazy lazy lazy
Privacy Loss as a lazy random walk Number of Steps busy busy busy busy busy Privacy loss 1 1 1 1 1 1 1 1 busy round = noisy answer close to forcing update W.h.p. bounded by O(sqrt(#busy)) lazy lazy lazy lazy lazy
Formalizing the random walk Imagine output of PMW is 0/1 indicator vector where vt=1 if round t update, 0 otherwise Recall: Very few updates! Vector is sparse. Theorem: Vector v is (ε,δ)-diffpriv.
Let D,D’ be neighboring DBs Let P,Q be corresponding output distributions Intution: X = privacy loss Approach: Sample v from P Consider X = log(P(v)/Q(v)) Argue Pr{ |X| > ε } ≤ δ Lemma: (3) implies (ε,δ)-diffpriv.
Total privacy loss Privacy loss in round t DRV’10 We’ll show: Xt = 0 if t not busy |Xt| ≤ ε0 if t busy Number of busy rounds O(#updates) E[X1+...+Xk]≤ O(ε02#updates) Azuma Strong concentration around expectation
Defining “busy” event Update condition: Busy event
Offline (non-interactive) Online (interactive) q1 Q a1 q2 S … a2 ✔ ✔
What we can do • Offline/batch setting: everyset of linear queries • Online/interactive setting: every sequence of adaptive and adversarial linear queries • Theoretical performance: Nearly optimal in the worst case • For instance-by-instance guarantee see H-Talwar10, Nikolov-Talwar (upcoming!), different techniques • Practical performance: Compares favorably to previous work! See Katrina’s talk. Are we done?
What we would like to do Running time: Linear dependence on |U| |U|exponential in #attributes of data Can we get poly(n)? No, in the worst-case for synthetic data [DNRRV09] even for simple query classes [Ullman-Vadhan10] No, in interactive setting without restricting query class [Ullman12] What can we do about it?