440 likes | 490 Views
Private Statistics: A TCS Perspective. Gautam Kamath Simons Institute University of Waterloo Data Privacy: Foundations and Applications Boot Camp January 29, 2019. Outline. Setting and Goals Hypothesis Testing Distribution Estimation (Some) Other Statistical Tasks.
E N D
Private Statistics:A TCS Perspective Gautam Kamath Simons Institute University of Waterloo Data Privacy: Foundations and Applications Boot Camp January 29, 2019
Outline • Setting and Goals • Hypothesis Testing • Distribution Estimation • (Some) Other Statistical Tasks
Algorithms vs. Statistics Algorithms M “utility” Statistics Distribution M randomsampling “utility”
Privacy in Statistics Statistics Desiderata: • Algorithm is accurate (with high probability over ) • May require assumptions about to hold • Algorithm is private (always) • This talk: -differentially private (usually) Distribution M randomsampling “utility”
Privacy and Utility Privacy Utility Privacy Utility
Why Worst-Case Privacy? • Can violate privacy of outliers • Salaries: Noise, then release dataset
Why Worst-Case Privacy? • Can violate privacy of outliers • Statistics can be retracted, private information can’t
Privacy? is -DP if for all inputs which differ on one entry: • This talk: Less sensitive statistics are cheaper to privatize • Sensitivity of : • Biggest difference on two neighboring datasets
“Utility”? How much data is needed to approximately infer a property of the underlying distribution with high probability? With samples, . Probably Approximately Correct (PAC) Learning [L. Valiant ’84]
The Cost of Privacy • How much more data is needed to guarantee privacy? • Sample complexity • : non-private cost • : additional cost due to privacy • In what situations is ?
Asymptotic: One word, two meanings • Asymptotic statistics • Guarantees when sample size approaches infinity • Example: “As , the statistic .” • Doesn’t quantify “error” for finite • Asymptotics for computer scientists • Hide constant factors and lower order terms • Example: “To achieve accuracy , we require samples.” • Read: require , for some fixed (known, but hidden) constant • Is “asymptotic optimality” good enough?
“Sample Complexity” versus “Rates” • Theorem: Given i.i.d. from , there exists an -private algorithm that, with probability , With samples, . For each , .
Hypothesis Testing • Given a dataset , was it generated from a distribution which satisfies some hypothesis? • “Yes or no?” question • : the null hypothesis • Today: ? • : some model of interest • Statisticians: Goodness-of-fit testing, one-sample testing • CS Theorists: Identity testing • Also today: : uniform distribution over integers • Multinomial data • “Uniformity testing”
Classical Hypothesis Testing • Goal: If the holds (), probability of “rejecting” is • : Significance (false positive rate) • Generally the important constraint, easier to control • If doesn’t hold, probability of “not rejecting” is • : Power • Problem: What if is very close to ? • Often consider an “alternative hypothesis” • Can often control , but have to measure • When holds, the statistic’s distribution is predictable • May require asymptotic approximations
Non-private: Pearson’s Chi-squared Test • : Uniform distribution over • : number of occurences of domain element • : number of samples
Non-private: Pearson’s Chi-squared Test • Theorem: Ifholds, as , . • Chi-squared distribution: , where • Intuition: If , • Use quantiles of to determine test outcome • If , output • -value of when • Doesn’t account for asymptotic approximation • No guarantees about power
Privatizing the Chi-Squared Testing [Gaboardi-Lim-Rogers-Vadhan ’16] • , • Lemma: As , • But finite-sample significance guarantees are now bad!
Privatizing the Chi-Squared Testing [Gaboardi-Lim-Rogers-Vadhan ’16] • , • Lemma: As , • But finite-sample significance guarantees are now bad! • Use Monte Carlo to determine new thresholds • Analytically understand distribution for finite • Significance is now accurate, but power could be improved • Some post-processing helps... • [Kifer-Rogers, ’17]: try to “project out” noise • Can we rigorously reason about the required size of ?
Minimax Hypothesis Testing • Alternative hypothesis : all distributions which are -far from • Parameterized by • required to make error rates under and both small? • Can be boosted to “high probability” at low cost
An minimax-optimal non-private test • [Acharya-Daskalakis-K. ’15] • Subtracting allows us to bound the variance • Separate mean of under and , apply Chebyshev’s • Sample complexity: • [Paninski ’08, G. Valiant-P. Valiant ’14] • “Sub-linear” in domain size • How much does privacy cost?
Subsample and Aggregate [Nissim-Raskhodnikova-Smith ’07] • Split dataset into parts • Compute function non-privately on each part • “Aggregate” results privately • Theorem: Private decision problems: non-private sample complexity • Proof: “Aggregate” = pick one of the results at random • Grants -DP -DP for decision problems • More general and powerful framework • Privatizing “Normal-ish” statistics [Smith ‘11] • PATE [Papernot-Song-Mironov-Raghunathan-Talwar-Erlingsson ‘18] • Baseline for private hypothesis testing:
A Sensitivity-Limited Chi-Squared Test [Cai-Daskalakis-K. ’17] • Sensitivity of is determined by • Sensitive if a count is much larger than its expectation • But then it can’t be the right distribution! • If is large, output • Else, noisily threshold • Sample complexity:
Even Better Tests! • Other optimal non-private statistics are more natural for privacy! • Counting number of non-observed elements [Paninski ’08] • Privatized in [Aliakbarpour-Diakonikolas-Rubinfeld ’18] • Empirical -distance [Diakonikolas-Gouleakis-Peebles-Price ’18] • Privatized in [Acharya-Sun-Zhang ’18] • Sample complexity: • Lower bounds in [Acharya-Sun-Zhang ’18]
Private Distribution Estimation Given samples from , (privately) learn such that . Choice of distance may vary... And it really matters!!
Univariate Learning: Multinomials • Privately estimate a discrete distribution over • In -distance: samples [folklore • See, e.g., [Diakonikolas-Hardt-Schmidt ’15] • Cost of privacy: minimal • In Kolmogorov distance: samples • [Beimel-Nissim-Stemmer ’13] • samples required! [Bun-Nissim-Stemmer-Vadhan ’15] • Cost of privacy: Hmm... *-DP
Univariate Learning: Gaussians • Privately estimate a Gaussian with , • In -distance: • samples • [Karwa-Vadhan ’18] • Equivalently: estimate and in “scale invariant” fashion • Cost of privacy: Mild dependence on the scale parameters
Multivariate Learning: Product Distributions • Privately estimate mean of a binary product distribution • In -distance: [folklore] • In -distance: [K.-Li-Singhal-Ullman ’18] • Corresponds to learning the distribution in -distance • In -distance: [Bun-Ullman-Vadhan ’14] • Cost of privacy: exponential! *-DP
Univariate Learning: Gaussians • Privately estimate a Gaussian with , • In -distance: • samples • [Karwa-Vadhan ’18] • Equivalently: estimate and in “scale invariant” fashion • Cost of privacy: Mild dependence on the scale parameters
Multivariate Learning: Gaussians *-DP • Privately estimate a Gaussian with , • In -distance: • samples • [K.-Li-Singhal-Ullman ’18] • Equivalently: estimate and in “scale invariant” fashion • Cost of privacy: Mild dependence on the scale parameters
Distribution Learning vs. Reconstruction • Do reconstruction attacks give good private learning LBs? • Not really: • Weak parameters • Kobbi’s talk: Can’t answer queries with accuracy • Gives an lower bound: trivial • Type mismatch • Throw out half your data: reconstruction is impossible • Throw out half your data: NBD, sample twice as much
Distribution Learning vs. Linear Queries • Distribution learning • Learn all queries, but for a simple class • E.g., product distributions: samples • Linear queries • Learn some queries, but for a complex class • For queries, samples • More from Gerome and Sasho tomorrow!
Distributional Functional Estimation • is a discrete distribution over , privately estimate some • Support size, distance to uniformity, entropy • “Estimating the unseen” [G. Valiant-P. Valiant ’11] • samples [Acharya-K.-Sun-Zhang ’18] • Cost of privacy: Negligible! • Privatizing low-sensitivity methods [Orlitsky-Suresh-Wu ’16], [Wu-Yang ’16]
Simple Hypothesis Testing • Determine whether was generated from (known) or • Compute likelihood of data, see which one is bigger • Neyman-Pearson Lemma: “The log-likelihood ratio test is optimal.” • Sample complexity: samples • Hellinger distance: • How to privatize?
Private Simple Hypothesis Testing • But... what is ? • Simple private hypothesis testing: not so simple • Theorem: has the optimal sample complexity (up to constants) • [Canonne-K.-McMillan-Smith-Ullman ’18] • Not quite Neyman-Pearson...
Private Simple Hypothesis Testing on Binomials • , • Neyman-Pearson: Threshold • Uniformly most powerful (UMP): same threshold is optimal for all simultaneously • A UMP private test for Binomial data • Noise using a “Truncated-Uniform-Laplace” (Tulap) distribution • [Awan-Slavković ’18] • Improves upon overlapping work by [Ghosh-Roughgarden-Sundararajan ’09] • UMPs can’t exist when the domain is larger than 2 • [Brenner-Nissim ’10]
Changepoint Detection • , • Output which minimizes • Non-private: Cumulative Sum (CUSUM) • Based on log-likelihood ratio test • Private analysis by [Cummings-Krehbiel-Mei-Tuo-Zhang ’18] • Same drawbacks as LLR... • Reduction from changepoint detection to simple hypothesis testing • Apply test from before • [Canonne-K.-McMillan-Smith-Ullman ’18]
Local Privacy • Hypothesis Testing • [Gaboardi-Rogers ’18], [Sheffet ’18], [Acharya-Canonne-Freitag-Tyagi ’19] • Distribution Estimation • [Duchi-Jordan-Wainwright ’13] • Multinomials: [Kairouz-Bonawitz-Ramage ’16], [Acharya-Sun-Zhang ’18], [Ye-Barg ’18] • Gaussians: [Gaboardi-Rogers-Sheffet ’19], [Joseph-Kulkarni-Mao-Wu ’18]
Other related tasks • PCA • [Chaudhuri-Sarwate-Sinha ’12], [Dwork-Talwar-Thakurta-Zhang ’14] • Clustering • [Wang-Wang-Singh ’15], [Balcan-Dick-Liang-Mou-Zhang ’17] • Computing Robust Statistics • [Dwork-Lei ’09]