310 likes | 522 Views
Privacy vs. Utility. Xintao Wu University of North Carolina at Charlotte Nov 10, 2008. Privacy. Legal interpretation View of privacy in terms of access that others have to us and our information. A general definition of privacy must be one that is measureable, of value, and actionable.
E N D
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008
Privacy • Legal interpretation • View of privacy in terms of access that others have to us and our information. • A general definition of privacy must be one that is measureable, of value, and actionable. • Measuring privacy • Secrecy: it concerns information that others may gather about us. • The probability of a data item being accessed • The change in knowledge of an adversary upon seeing the data • Anonymity: it addresses how much in the public gaze we are. • The privacy leakage is measured in terms of the size of the blurring accompanying the release of data. • Solitude: it measures the degree to which others have physical access to us.
Privacy vs. Utility • Encryption does not work in publishing scenario. • Utility • The goal of privacy preservation measures is to secure access to confidential information while at the same time releasing aggregate information to the public.
Data anonymization methods • Random perturbation • Input perturbation • Output perturbation • Generalization • The data domain has a natural hierarchical structure. • The degree of perturbation can be measured in terms of the height of the resulting generalization above the leaf values. • Suppression • Permutation • Destroying the link between identifying and sensitive attributes that could lead to a privacy leakage.
Statistical measures of anonymity • Query restriction • For a database of size N, and a fixed parameter k, all queries that returned either fewer than k or more than N-k records were rejected. • Could be subverted by requesting a specific sequence of queries • Anonymity via variance • Lower bound the variance for estimators of sensitive attributes • Utility is measured (by combining the perturbation scheme with a query restriction method) as the fraction of queries that are permitted after perturbation. • Confidence interval • How hard it is to reconstruct the original data distribution • Anonymity via multiplicity • K-anonymity
Probabilistic measures of anonymity • Knowing aggregate information about the data as well as the method of perturbation • Perturb X with a random value from [-1,1], the privacy achieved is 2. • The distribution of X is revealed, [0,1] with prob. 0.5 and [4,5] with prob. 0.5 • The privacy achieved is reduced to 1 • Mutual information • P(A|B) = 1 – 2^H(A|B)/2^H(A) = 1-2^(-I(A;B)) • H(A) encodes the amount of uncertainty (the degree of privacy) in a random variable. • H(A|B) the amount of privacy left in A after B is released. • I(A;B) = H(A)- H(A|B) mutual information between A and B • Utility • Statistical distance between the source distribution of the data and perturbed distribution.
Market basket data • A privacy breach is defined as one in which the probability of some property of the input data is high, conditioned on the output perturbed data having certain properties. (Evfimievski et al.) • Privacy is measured in terms of the probability of correctly reconstructing the original bit, given a perturbed bit. (Rizvi and Haristsa) • Utility is the problem of reconstructing itemset frequencies accurately.
Measuring of transfer information If we look back from y, there is no easy way of telling whether the source is x1 or x2 Limiting privacy breaches in privacy preserving data mining, PODS03
Measured based on generalization • K-anonymity • L-diversity • P-sensitive-k-anonymity • T-closeness • L-diversity may be difficult and unnecessary to achieve • The sensitive attribute is the rest result for a virus. 99% of them being negative. The positive/negative have different degrees of sensitivity. • L-diversity is insufficient to prevent atytribute disclosure • Skewness attack, e.g, one equivalence class has an equal number of positive/negative records. • Similarity attack when the sensitive attribute values in an equivalence class are distinct but semantically similar. • T-closeness if the distance between the distribution of a sensitive attribute in this class and that of the attribute in the whole table is no more than t.
Permutation • The goal of the k-anonymous blocks is that the diameter of the range of sensitive attributes is larger than a parameter e. • Permutation based anonymization can answer aggregate queries more accurately than generalization based anonymization.
Anonymizing inference • To protect the possible inferences that can be made from the data • A privacy template is an inference on the data, coupled with a confidence bound. The requirement is that in the anonymized data, this inference not be valid with a confidence larger than the provided bound. • Wang et al. Handicapping attacker’s confidence: an alternate to k-anonymization
Measuring utility in generalization based anonymity • The precision of a generalization scheme is 1 – the average height of a generalization (measured over all cells). Bayardo and Agrawal ICDE 05
Utility vs. privacy • Most of the schemes for ensuring data anonymity focus on defining measures of anonymity, while using ad hoc measures of utility. • After performing a standard anonymization, they publish carefully chosen marginals of the source data. From these marginals, they then construct a consistent maximum entropy distribution, and measure utility as the KL-distance between this distribution and the source. • Kifer & Gehrke. Injecting utility into anonymized datasets. SIGMOD06 • Rastogi et al. The boundary between privacy and utility in data publishing
Computational measures of anonymity • Privacy statements are phrased in terms of the power of an adversary., rather than the amount of background knowledge they possess. • Dinur & Nissim. Revealing information while preserving privacy. PODS03 • Measuring anonymity via information transfer • Indistinguishability • A database is private if anything learnable from it can be learned in the absence of the database
Anonymity via isolation • A record is private if it cannot be singled out from its neighbors. • An adversary is defined as an algorithm that takes an anonymized database and some auxiliary information, and outputs a single point q. • An anonymization is successful if the adversary, combining the anonymization with auxiliary information, can do no better at isolation than a weaker adversary with no access to the anonymized data.
Metrics for quantifying data quality • Quality of the data resulting from the ppdm process • Accuracy • Completeness • consistency • Quality of the data mining results • Chapter 8.4
measures Oliveira & Zaiane, privacy preserving frequent itemset mining, 2002
Generalization based • The data quality metric is based on the height of generalization hierarchies. • Data should be generalized as fewer steps as possible to preserve maximum utility. • Not every generalization steps are equal in the sense of information loss. • General loss metric • Classification metric • Iyengar KDD02 • Discernibility metric • Bayado & Agarwal ICDE05