780 likes | 1.39k Views
PrivBayes: Private Data Release via Bayesian Networks. Jun Zhang , Graham Cormode , Cecilia M. Procopiuc , Divesh Srivastava , Xiaokui Xiao. Overview. The Problem: Private Data Release Differential Privacy Challenges The Algorithm: PrivBayes Bayesian Network Details of PrivBayes
E N D
PrivBayes: Private Data Release via Bayesian Networks Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, DiveshSrivastava, Xiaokui Xiao
Overview • The Problem: Private Data Release • Differential Privacy • Challenges • The Algorithm: PrivBayes • Bayesian Network • Details of PrivBayes • Function : Linear vs. Logarithmic • Experiments
Overview • The Problem: Private Data Release • Differential Privacy • Challenges • The Algorithm: PrivBayes • Bayesian Network • Details of PrivBayes • Function : Linear vs. Logarithmic • Experiments
Data Release company institute sensitive database public adversary
Private Data Release similar properties company accurate inference sensitive database synthetic database How can we design such a private data release algorithm? adversary
Overview • The Problem: Private Data Release • Differential Privacy • Challenges • The Algorithm: PrivBayes • Bayesian Network • Details of PrivBayes • Function : Linear vs. Logarithmic • Experiments
Differential Privacy [TCC’06] • Definition of -Differential Privacy • A randomizeddata release algorithm satisfies -differential privacy, if for any two neighboring datasets and for any possible synthetic data ,
Differential Privacy [TCC’06] • A general approach to achieve differential privacy is injecting Laplace noise to the output, in order to cover the impact of any individual! • More details in Preliminaries part of the paper
Our Target Design a data release algorithm with differential privacy guarantee.
Overview • The Problem: Private Data Release • Differential Privacy • Challenges • The Algorithm: PrivBayes • Bayesian Network • Details of PrivBayes • Function : Linear vs. Logarithmic • Experiments
Challenges of Private Data Release • To build a synthetic data, we need to understand the tuple distribution of the sensitive data. convert + noise sample sensitive database full-dim tuple distribution noisy distribution synthetic database
Challenges of Private Data Release • Example: Database has 10M tuples, 10 attributes (dimensions), and 20 values per attribute: • Scalability: full distribution has cells • most of them have non-zero counts after noise injection • privacy is expensive (computation, storage) • Signal-to-noise: avg. information in each cell is ; avg. noise is (for ) Previous solutions suffer from either scalability or signal-to-noise problem
Overview • The Problem: Private Data Release • Differential Privacy • Challenges • The Algorithm: PrivBayes • Bayesian Network • Details of PrivBayes • Function : Linear vs. Logarithmic • Experiments
PrivBayes: Dimension Reduction convert + noise sample sensitive database full-dim tuple distribution noisy distribution synthetic database approximate sample convert + noise a set of low-dim distributions noisy low-dim distributions
PrivBayes: Dimension Reduction • The advantages of using low-dimensional distributions • easy to compute • small domain -> high signal density -> robust against noise • But, how to find a set of low-dim distributions that provides a good approximation to full distribution?
Overview • The Problem: Private Data Release • Differential Privacy • Challenges • The Algorithm: PrivBayes • Bayesian Network • Details of PrivBayes • Function : Linear vs. Logarithmic • Experiments
Bayesian Network • A -dimensional database: workclass age income title education
Bayesian Network • A -dimensional database: workclass age income title education
Bayesian Network workclass age income title education Quality of Bayesian network decides the quality of approximation
Overview • The Problem: Private Data Release • Differential Privacy • Challenges • The Algorithm: PrivBayes • Bayesian Network • Details of PrivBayes • Function : Linear vs. Logarithmic • Experiments
Outline of the Algorithm • STEP 1: Choose a suitable Bayesian network • must in a differentially private way • STEP 2: Compute conditional distributions implied by • straightforward to do under differential privacy • inject noise – Laplace mechanism • STEP 3: Generate synthetic data by sampling from • post-processing: no privacy issues
Optimal Bayesian Network • Finding optimal -degree Bayesian network was solved in [Chow-Liu’68]. It is a DAG of maximum in-degree , and maximizes the sum of mutual information of its edges where
Optimal Bayesian Network • Finding optimal -degree Bayesian network was solved in [Chow-Liu’68]. It is a DAG of maximum in-degree , and maximizes the sum of mutual information of its edges finding the maximum spanning tree, where the weight of edge is mutual information .
Build a Bayesian Network • Build a -degree BN for database
Build a Bayesian Network • Start from a random attribute A C B D
Build a Bayesian Network • Select next tree edge by its mutual information A candidates: C B D
Build a Bayesian Network • Select next tree edge by its mutual information A candidates: C B D
Build a Bayesian Network • Select next tree edge by its mutual information A C B D
Build a Bayesian Network • Select next tree edge by its mutual information A candidates: C B D
Build a Bayesian Network • Select next tree edge by its mutual information A DONE! C B D
-degree Bayesian Network • It is NP-hard to train the optimal -degree Bayesian network, when [JMLR’04]. • Most approximation algorithms are too complicated to be converted into private algorithms. • In our paper, we find a way to extend the Chow-Liu solution (-degree) to higher degree cases. • In this talk, we focus on -degree cases for simplicity.
Private Bayesian Network • Do it under Differential Privacy! • (Non-private) select the edge with maximum • (Private) is data-sensitive -> the best edge is also data-sensitive Solution: randomizededge selection!
Exponential Mechanism [FOCS’07] Databases Edges • Howgood edge is as the result of selection, given database define Return with probability: where
Private Bayesian Network Problem solved? NO Sensitivity (noise scale) is too large for • Do it under Differential Privacy! • Select edges with exponential mechanism • define (edge) = (edge) • we prove , where . (Lemma 1)
Overview • The Problem: Private Data Release • Differential Privacy • Challenges • The Algorithm: PrivBayes • Bayesian Network • Details of PrivBayes • Function : Linear vs. Logarithmic • Experiments
Basic Facts and have a strong positive correlation
Function IDEA: define scoreto agree with at maximum values and interpolate linearly in-between : “optimal” dbns over that maximize how far? Range of : Sensitivity of :
Function 0.4 1.6
vs. and of random distributions correlation coefficient
Overview • The Problem: Private Data Release • Differential Privacy • Challenges • The Algorithm: PrivBayes • Bayesian Network • Details of PrivBayes • Function : Linear vs. Logarithmic • Experiments
vs. Adult dataset
Dataset • We use four datasets in our experiments • Adult, NLTCS, TPC-E, BR2000 • Adult dataset • census data of 45,222 individuals • 15 attributes: age, workclass, education, marital status, etc. • tuple domain size (full-dimensional): about
Counting Queries Query: all -way marginals Query: all -way marginals
Multiple SVMs Adult, education Adult, gender Query: build 4 classifiers
Multiple SVMs Adult, education Adult, gender Query: build 4 classifiers
Concluding Remarks • Differential privacy can be applied effectively for data release • Key ideas of the solution: • Bayesian networks for dimension reduction • carefully designed linear quality for exponential mechanism • Many open problems remain: • extend to other forms of data: graph data, mobility data • obtain alternate (workable) privacy definitions Thanks!
Previous Work • Privacy, accuracy, and consistency too: a holistic solution to contingency table release [PODS’07] • incurs an exponential running time • only optimized for low-dimensional marginals • Differentially private publication of sparse data [ICDT’12] • achieves scalability, but no help for signal-to-noise problem • Differentially private spatial decompositions [ICDE’12] • coarsens the histogram H to control nr. cells • has some limits, e.g., range queries, ordinal domain
: Optimal Distributions • Assume that . A distribution maximizes the mutual information between and if and only if • , for any ; • For each , there is at most one with .
Analogy: Logarithmic vs.Linear • two score functions for real and • neighboring databases and • Sensitivity (noise) max of derivative and