Privacy by Learning the Database

Privacy by Learningthe Database Moritz Hardt DIMACS, October 24, 2012

Isn’t privacy the opposite of learning the database?

Curator query set Q Analyst privacy-preserving structure S accurate on Q data set D = multi-set over universe U

Data set D as N-dimensional histogram where N=|U| D[i] = # elements in D of type i Normalized histogram = distribution over universe . . . . . . Statistical query q (aka linear/counting): Vector q in [0,1]N 1 1 2 2 3 3 N N 4 4 5 5 1 q(D) :=<q,D> 0 q(D) in [0,1]

Why statistical queries? Lots of data analysis reduces to multiple statistical queries • Perceptron, ID3 decision trees, PCA/SVM, k-means clustering [BlumDworkMcSherryNissim’05] • Any SQ-learning algorithm [Kearns’98] • includes “most” known PAC-learning algorithms

Curator’s wildest dream: This seems hard!

Curator’s 2nd attempt: Intuition: Entropy implies privacy

Two pleasant surprises Approximately solved by multiplicative weights update[Littlestone89,...] Can easily be made differentially private

Why did learning theorists care to solve privacy problems 20 years ago? Answer: Entropy implies generalization

Unknown concept example set Q Learner hypothesis h accurate on all examples Maximizing entropy implies hypothesis generalizes

Privacy Learning Unknown concept Examples labeled by concept Hypothesis approximates target concept on examples Must Generalize Sensitive database Queries labeled by answer on DB Synopsis approximates DB on query set Must Preserve Privacy

How can we solve this? Ellipsoid We’ll take a different route. Concave maximization s.t. linear constraints

Start with uniform D0 “What’s wrong with it?” Query q violates constraint! Minimize entropy loss s.t. correction Closed form expression for Dt+1? Well...

Relax Think Approximate Closed form expression for Dt+1? YES!

Multiplicative Weights Update

Dt D 1 . . . 0 1 2 3 N 4 5 At step t

Dt D q 1 . . . 0 1 2 3 N 4 5 At step t Suppose q(Dt) < q(D)

Dt D q 1 . . . 0 1 2 3 N 4 5 After step t

Multiplicative Weights Update Algorithm: D0 uniform For t =1...T Find bad query q Dt+1 = Update(Dt,q) How quickly do we run out of bad queries?

Progress Lemma: if q bad Put

Progress Lemma: if q bad Facts: At most steps Error bound

Algorithm: D0 uniform For t =1...T Find bad query q Dt+1 = Update(Dt,q) What about privacy? Only step that interactswith D

Differential Privacy [Dwork-McSherry-Nissim-Smith-06] Two data sets D,D’ are called neighboring if they differ in one element. Definition (Differential Privacy): A randomized algorithm M(D) is called (ε,δ)-differentially private if for any two neighboring data sets D,D’ and all events S:

Laplacian Mechanism [DMNS’06] Given query q: Compute q(D) Output q(D) + Lap(1/ε0n) Fact: Satisfies ε0-differential privacy Note: Sensitivity of q is 1/n

Query selection |q(D)-q(Dt)| … q1 q2 q3 qk

Query selection |q(D)-q(Dt)| Add Lap(1/ε0n) … q1 q2 q3 qk

Query selection |q(D)-q(Dt)| … q1 q2 q3 qk Pick maximal violation

Query selection |q(D)-q(Dt)| Lemma [McSherry-Talwar’07]: Selected index satisfies ε0-differential privacy and w.h.pViolation > … q1 q2 q3 qk Pick maximal violation

Also use noisy answer in update rule Algorithm: D0 uniform For t =1...T Noisy selection of q Dt+1 = Update(Dt,q) Now: Each step satisfies ε0-differential privacy! New error bound: What is the total privacy guarantee?

T-fold composition of ε0-differential privacy satisfies: Answer 1 [DMNS’06]: ε0T-differential privacy Answer 2 [DRV’10]: (ε,δ)-differential privacy Note: for small enough ε

Theorem 1. On databases of size n MW achieves ε-differential privacy with ε,δ Composition Theorems Optimize T, ε0 Error bound Theorem 2. MW achieves (ε,δ)-differential privacy with Optimal dependence on |Q| and n

Offline (non-interactive) Online (interactive) q1 Q a1 q2 S … a2 ✔ ? H-Ligett-McSherry12, Gupta-H-Roth-Ullman11 H-Rothblum10 See also: Roth-Roughgarden10, Dwork-Rothblum-Vadhan10, Dwork-Naor-Reingold-Rothblum-Vadhan09, Blum-Ligett-Roth08

Private MW Online [H-Rothblum’10] Algorithm: Given query qt: • If |qt(Dt)- qt(D) | < α/2 + Lap(1/ε0n) • Output qt(Dt) • Otherwise • Output qt(D) + Lap(1/ε0n) • Dt+1 = Update(Dt,qt) Achieves same error bounds!

Overview: Privacy Analysis • Offline setting: T << n steps • Simple analysis using Composition Theorems • Online setting: k >> n invocations of Laplace • Composition Thms don’t suggest small error! • Idea: Analyze privacy loss like lazy random walk (goes back to Dinur-Dwork-Nissim’03)

Privacy Loss as a lazy random walk Number of Steps

Privacy Loss as a lazy random walk Number of Steps Privacy loss

Privacy Loss as a lazy random walk Number of Steps busy busy busy busy busy Privacy loss busy round = noisy answer close to forcing update lazy lazy lazy lazy lazy

Privacy Loss as a lazy random walk Number of Steps busy busy busy busy busy Privacy loss 1 busy round = noisy answer close to forcing update lazy lazy lazy lazy lazy

Privacy Loss as a lazy random walk Number of Steps busy busy busy busy busy Privacy loss 1 1 busy round = noisy answer close to forcing update lazy lazy lazy lazy lazy

Privacy Loss as a lazy random walk Number of Steps busy busy busy busy busy Privacy loss 1 1 1 busy round = noisy answer close to forcing update lazy lazy lazy lazy lazy

Privacy Loss as a lazy random walk Number of Steps busy busy busy busy busy Privacy loss 1 1 1 1 busy round = noisy answer close to forcing update lazy lazy lazy lazy lazy

Privacy Loss as a lazy random walk Number of Steps busy busy busy busy busy Privacy loss 1 1 1 1 1 1 1 1 busy round = noisy answer close to forcing update lazy lazy lazy lazy lazy

Privacy Loss as a lazy random walk Number of Steps busy busy busy busy busy Privacy loss 1 1 1 1 1 1 1 1 busy round = noisy answer close to forcing update W.h.p. bounded by O(sqrt(#busy)) lazy lazy lazy lazy lazy

Formalizing the random walk Imagine output of PMW is 0/1 indicator vector where vt=1 if round t update, 0 otherwise Recall: Very few updates! Vector is sparse. Theorem: Vector v is (ε,δ)-diffpriv.

Let D,D’ be neighboring DBs Let P,Q be corresponding output distributions Intution: X = privacy loss Approach: Sample v from P Consider X = log(P(v)/Q(v)) Argue Pr{ |X| > ε } ≤ δ Lemma: (3) implies (ε,δ)-diffpriv.

Total privacy loss Privacy loss in round t DRV’10 We’ll show: Xt = 0 if t not busy |Xt| ≤ ε0 if t busy Number of busy rounds O(#updates) E[X1+...+Xk]≤ O(ε02#updates) Azuma Strong concentration around expectation

Defining “busy” event Update condition: Busy event

Offline (non-interactive) Online (interactive) q1 Q a1 q2 S … a2 ✔ ✔

What we can do • Offline/batch setting: everyset of linear queries • Online/interactive setting: every sequence of adaptive and adversarial linear queries • Theoretical performance: Nearly optimal in the worst case • For instance-by-instance guarantee see H-Talwar10, Nikolov-Talwar (upcoming!), different techniques • Practical performance: Compares favorably to previous work! See Katrina’s talk. Are we done?

What we would like to do Running time: Linear dependence on |U| |U|exponential in #attributes of data Can we get poly(n)? No, in the worst-case for synthetic data [DNRRV09] even for simple query classes [Ullman-Vadhan10] No, in interactive setting without restricting query class [Ullman12] What can we do about it?

Privacy by Learning the Database

Privacy by Learning the Database

Presentation Transcript

Database Security/Privacy and Social Networking

Privacy by Design

Secure and Privacy-Preserving Database Services in the Cloud

Privacy By Design Sample Use Case Privacy Controls

Privacy By Design Draft Privacy Use Case Template

Privacy in Distributed Database Systems

Privacy by Design Discussions

Privacy in a Demographic Database

Privacy in a Demographic Database Project plan

Database Security for Privacy

DataBase Technology by

Privacy Enhancing Techniques for Database Systems

Dangerous Access: database technology and privacy

Privacy Preserving Learning of Decision Trees

Privacy by Design

Database Security and Privacy

The Learning Aims Database

Privacy in Database s

Privacy by Design : Big Privacy for Big Data

PRIVACY BY DESIGN AND BY DEFAULT

Threats to privacy in the forensic analysis of database systems

100% Privacy-Compliant Pharmacy Database - Healthexedata