1 / 32

Part 2: Basics

Part 2: Basics. Marianne Winslett 1,4 , Xiaokui Xiao 2 , Gerome Miklau 3 , Yin Yang 4 , Zhenjie Zhang 4 , 1 University of Illinois at Urbana Champaign, USA 2 Nanyang Technological University, Singapore 3 University of Massachusetts Amherst, USA

elmer
Download Presentation

Part 2: Basics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Part 2: Basics Marianne Winslett1,4, Xiaokui Xiao2, Gerome Miklau3, Yin Yang4, ZhenjieZhang4, 1 University of Illinois at Urbana Champaign, USA 2Nanyang Technological University, Singapore 3University of Massachusetts Amherst, USA 4 Advanced Digital Sciences Center, Singapore

  2. Agenda • Basics: Xiaokui • Query Processing (Part 1): Gerome • Query Processing (Part 2): Yin • Data Mining and Machine Learning: Zhenjie

  3. Formulation of Privacy • What information can be published? • Average height of US people • Height of an individual • Intuition: • If something is insensitive to the change of any individual tuple, then it should not be considered private • Example: • Assume that we arbitrarily change the height of an individual in the US • The average height of US people would remain roughly the same • i.e., The average height reveals little information about the exact height of any particular individual

  4. -Differential Privacy • Motivation: • It is OK to publish information that is insensitive to the change of any particular tuple in the dataset • Definition: • Neighboring datasets: Two datasets and , such that can be obtained by changing one single tuple in • A randomized algorithm satisfies -differential privacy, iff for any two neighboring datasets and and for any output of ,

  5. -Differential Privacy ratio • Intuition: • It is OK to publish information that is insensitive to changes of any particular tuple • Definition: • Neighboring datasets: Two datasets and , such that can be obtained by changing one single tuple in • A randomized algorithm satisfies -differential privacy, iff for any two neighboring datasets and and for any output of , • The value of decides the degree of privacy protection

  6. Achieving -Differential Privacy • Example: • Dataset: A set of patients • Objective: Release the number of diabetes patients with -differential privacy • It won’t work if we release the number directly: • : the original dataset • : modify an arbitrary patient in • does not hold for any 100% # of diabetes patients

  7. Achieving -Differential Privacy • Example: • Dataset: A set of patients • Objective: Release the number of diabetes patients with -differential privacy • Idea: • Perturb the number of diabetes patients to obtain a smooth distribution 100% # of diabetes patients

  8. Achieving -Differential Privacy • Example: • Dataset: A set of patients • Objective: Release the number of diabetes patients with -differential privacy • Idea: • Perturb the number of diabetes patients to obtain a smooth distribution 100% # of diabetes patients

  9. Achieving -Differential Privacy • Example: • Dataset: A set of patients • Objective: Release the number of diabetes patients with -differential privacy • Idea: • Perturb the number of diabetes patients to obtain a smooth distribution ratio bounded # of diabetes patients

  10. Laplace Distribution • ; • increase/decrease by •  changes by a factor of • ; is referred as the scale

  11. Differential Privacy via Laplace Noise • Dataset: A set of patients • Objective: Release # of diabetes patients with -differential privacy • Method: Release the number + Laplace noise • Rationale: • : the original dataset; # of diabetes patients = • : modify a patient in ; # of diabetes patients = ratio bounded # of diabetes patients

  12. Differential Privacy via Laplace Noise • Dataset: A set of patients • Objective: Release # of diabetes patients with -differential privacy • Method: Release the number + Laplace noise • Rationale: • : the original dataset; # of diabetes patients = • : modify a patient in ; # of diabetes patients = # of diabetes patients

  13. Differential Privacy via Laplace Noise • Dataset: A set of patients • Objective: Release # of diabetes patients with -differential privacy • Method: Release the number + Laplace noise • Rationale: • : the original dataset; # of diabetes patients = • : modify the height of an individual in ; # of diabetes patients = # of diabetes patients

  14. Differential Privacy via Laplace Noise • Dataset: A set of patients • Objective: Release # of diabetes patients with -differential privacy • Method: Release the number + Laplace noise • Rationale: • : the original dataset; # of diabetes patients = • : modify the height of an individual in ; # of diabetes patients = # of diabetes patients

  15. Differential Privacy via Laplace Noise • Dataset: A set of patients • Objective: Release # of diabetes patients with -differential privacy • Method: Release the number + Laplace noise • Rationale: • : the original dataset; # of diabetes patients = • : modify the height of an individual in ; # of diabetes patients = # of diabetes patients

  16. Differential Privacy via Laplace Noise • Dataset: A set of patients • Objective: Release # of diabetes patients with -differential privacy • Method: Release the number + Laplace noise • Rationale: • : the original dataset; # of diabetes patients = • : modify the height of an individual in ; # of diabetes patients = # of diabetes patients

  17. Differential Privacy via Laplace Noise • Dataset: A set of patients • Objective: Release # of diabetes patients with -differential privacy • Method: Release the number + Laplace noise • Rationale: • : the original dataset; # of diabetes patients = • : modify the height of an individual in ; # of diabetes patients = # of diabetes patients

  18. Differential Privacy via Laplace Noise • Dataset: A set of patients • Objective: Release # of diabetes patients with -differential privacy • Method: Release the number + Laplace noise • Rationale: • : the original dataset; # of diabetes patients = • : modify the height of an individual in ; # of diabetes patients = # of diabetes patients

  19. Differential Privacy via Laplace Noise • We aim to ensure -differential privacy • How large should be? •  • How large can be? • Change of a patient’s data would change the number of diabetes patients by at most , i.e., • Conclusion: Setting would ensure -differential privacy # of diabetes patients

  20. Differential Privacy via Laplace Noise • In general, if we want to release a value • Add Laplace noise into • To decide the scale of Laplace noise • Look at the maixmum change that can occur in (when we change one tuple in the dataset) • Set to be proportional to the maximum change

  21. Differential Privacy via Laplace Noise • What if we have multiple values? • Add Laplace noise to each value • How do we decide the noise scale? • Look at the total change that can occur in the values when we modify one tuple in the data • Total change: sum of the absolute change in each value (i.e., differences in L1 norm) • Set the scale of the noise to be proportional to the maximum total change • The maximum total change is referred to as the sensitivity of the values • Theorem [Dwork et al. 2006]: Adding Laplace noise of scale to each value ensures -differential privacy, if

  22. Sensitivity of Queries • Histogram • Sensitivity of the bin counts: 2 • Reason: When we modify a tuple in the dataset, at most two bin counts would change; furthermore, each bin count would change by at most 1 • Scale of Laplace noise required: • For more complex queries, the derivation of sensitivity can be much more complicated • Example: Parameters of a logistic model

  23. Geometric Mechanism • Suitable for queries with integer outputs [Ghosh et al. 2009] • Adds noise to each query; the noise follows a two-sided geometric distribution: , for integer • Queries with sensitivity require geometric noise with Laplace Geometric

  24. Exponential Mechanism • Suitable for queries with non-numeric outputs [McSherry and Talwar 2007] • Example: Top- frequent itemset from a transaction database • Basic Idea: • Sample one itemset from the set of all possible itemsets • Itemsets with higher frequencies are more likely to be sampled • Any change of a single transaction in should lead to only bounded change in the sampling probability

  25. Exponential Mechanism (cont.) • Details: • Denote the frequency an itemset as • Sampling probability of : proportional to • Why this ensures -differential privacy: • When change one transaction in the dataset • changes by at most 1 • changes by a factor of at most • For any , changes by at most 1 • changes by a factor of at most • Thus, changes by a factor of at most • We can achieve -differential privacy by setting

  26. Exponential Mechanism (cont.) • General case: • A dataset • An output space • A score function , such that measures the “goodness” of given dataset • Sample any with probability proportional to • Theorem [McSherry and Talwar 2007]: • Achieve -differential privacy by setting , where denotes the sensitivity of the score function

  27. Composition of Differential Privacy • What if we want to compute the top- frequent itemsets with -differential privacy? • Solution: Apply the previous exponential mechanism times, each with -differential privacy • Corollary from [McSherry and Tulwar2008]: • The sequential application of algorithms each giving -differential privacy, would ensure -differential privacy.

  28. Variants of Differential Privacy • Alternative definition of neighboring dataset: • Two datasets and , such that is obtained by adding/deleting one tuple in • Even if a tuple is added to or removed from the dataset, the output distribution of the algorithm is roughly the same • i.e., the output of the algorithm does not reveal the presence of a tuple • Refer to this version as “unbounded” differential privacy, and the previous version as “bounded” differential privacy

  29. Variants of Differential Privacy • Bounded: is obtained by changing the values of one tuple in • Unbounded: is obtained by adding/removing one tuple in • Observation 1 • Change of a tuple can be regarded as removing a tuple from the dataset and then inserting a new one • Indication: Unbounded -differential privacy implies bounded -differential privacy • Proof:

  30. Variants of Differential Privacy • Bounded: is obtained by changing the values of one tuple in • Unbounded: is obtained by adding/removing one tuple in • Observation 2 • Bounded differential privacy allows us to directly publish the number of tuples in the dataset • Unbounded differential privacy does not allow this

  31. Variants of Differential Privacy • -differential privacy • Allows a small probability of failure • For any two neighboring datasets and , and for any set of outputs, • Relaxation of privacy provides more room for utility • Example: Use of Gaussian noise instead of Laplace noise • Composition: a set of -differentially private algorithms  -differential privacy Region where -differential privacy is not satisfied

  32. Limitations of Differential Privacy • Differential privacy tends to be less effective when there exist correlations among the tuples • Example (from [Kifer and Machanavajjhala 2011]): • Bob’s family includes 10 people, and all of them are in a database • There is a highly contagious disease, such that if one family member contracts the disease, then the whole family will be infected • Differential privacy would underestimate the risk of disclosure • Summary: Amount of noise needed depends on the correlations among the tuples, which is not captured by differential privacy

More Related