680 likes | 939 Views
CS573 Data Privacy and Security Anonymization methods. Li Xiong. Today. Permutation based anonymization methods (cont.) Other privacy principles for m icrodata publishing Statistical databases. Anonymization methods. Non-perturbative: don't distort the data Generalization
E N D
CS573 Data Privacy and SecurityAnonymization methods Li Xiong
Today • Permutation based anonymization methods (cont.) • Other privacy principles for microdata publishing • Statistical databases
Anonymization methods • Non-perturbative: don't distort the data • Generalization • Suppression • Perturbative: distort the data • Microaggregation/clustering • Additive noise • Anatomization and permutation • De-associate relationship between QID and sensitive attribute
Concept of the Anatomy Algorithm • Release 2 tables, quasi-identifier table (QIT) and sensitive table (ST) • Use the same QI groups (satisfy l-diversity), replace the sensitive attribute values with a Group-ID column • Then produce a sensitive table with Disease statistics
Specifications of Anatomy cont. DEFINITION 3. (Anatomy) With a given l-diverse partition anatomy will create QIT and ST tables QIT will be constructed as the following: (Aqi1,Aqi2, ..., Aqid,Group-ID) ST will be constructed as the following: (Group-ID, As, Count)
Privacy properties THEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/l
Comparison with generalization • Compare with generalization on two assumptions: • A1: the adversary has the QI-values of the target individual A2: the adversary also knows that the individual is definitely in the microdata • If A1 and A2 are true, anatomy is as good as generalization 1/l holds true • If A1 is true and A2 is false, generalization is stronger • If A1 and A2 are false, generalization is still stronger
Preserving Data Correlation • Examine the correlation between Age and Disease in T using probability density function pdf • Example: t1
Preserving Data Correlation cont. • To re-construct an approximate pdf of t1 from the generalization table:
Preserving Data Correlation cont. • To re-construct an approximate pdf of t1 from the QIT and ST tables:
Preserving Data Correlation cont. • To figure out a more rigorous comparison, calculate the “L2 distance” with the following equation: • The distance for anatomy is 0.5 while the distance for generalization is 22.5
Preserving Data Correlation cont. Idea: Measure the error for each tuple by using the following formula: Objective: for all tuplest in T and obtain a minimal re-construction error (RCE): Algorithm: Nearly-Optimal Anatomizing Algorithm
Experiments • dataset CENSUS that contained the personal information of 500k American adults containing 9 discrete attributes • Created two sets of microdata tables • Set 1: 5 tables denoted as OCC-3, ..., OCC-7 so that OCC-d (3 ≤d≤ 7) uses the first d as QI-attributes and Occupation as the sensitive attribute As • Set 2: 5 tables denoted as SAL-3, ..., SAL-7 so that SAL-d (3 ≤d≤ 7) uses the first d as QI-attributes and Salary-class as the sensitive attribute Asg
Today • Permutation based anonymization methods (cont.) • Other privacy principles for microdata publishing • Statistical databases • Differential privacy
Attacks on k-Anonymity • k-Anonymity does not provide privacy if • Sensitive values in an equivalence class lack diversity • The attacker has background knowledge A 3-anonymous patient table Homogeneity attack Background knowledge attack
l-Diversity [Machanavajjhala et al. ICDE ‘06] Sensitive attributes must be “diverse” within each quasi-identifier equivalence class
Distinct l-Diversity • Each equivalence class has at least l well-represented sensitive values • Doesn’t prevent probabilistic inference attacks 8 records have HIV 10 records 2 records have other values
Other Versions of l-Diversity • Probabilistic l-diversity • The frequency of the most frequent value in an equivalence class is bounded by 1/l • Entropy l-diversity • The entropy of the distribution of sensitive values in each equivalence class is at least log(l) • Recursive (c,l)-diversity • r1<c(rl+rl+1+…+rm) where ri is the frequency of the ith most frequent value • Intuition: the most frequent value does not appear too frequently
Neither Necessary, Nor Sufficient Original dataset 99% have cancer
Neither Necessary, Nor Sufficient Original dataset Anonymization A 50% cancer quasi-identifier group is “diverse” 99% have cancer
Neither Necessary, Nor Sufficient Original dataset Anonymization A Anonymization B 99% cancer quasi-identifier group is not “diverse” 50% cancer quasi-identifier group is “diverse” This leaks a ton of information 99% have cancer
Limitations of l-Diversity • Example: sensitive attribute is HIV+ (1%) or HIV- (99%) • Very different degrees of sensitivity! • l-diversity is unnecessary • 2-diversity is unnecessary for an equivalence class that contains only HIV- records • l-diversity is difficult to achieve • Suppose there are 10000 records in total • To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classes
Skewness Attack • Example: sensitive attribute is HIV+ (1%) or HIV- (99%) • Consider an equivalence class that contains an equal number of HIV+ and HIV- records • Diverse, but potentially violates privacy! • l-diversity does not differentiate: • Equivalence class 1: 49 HIV+ and 1 HIV- • Equivalence class 2: 1 HIV+ and 49 HIV- l-diversity does not consider overall distribution of sensitive values!
Sensitive Attribute Disclosure A 3-diverse patient table Similarity attack Conclusion Bob’s salary is in [20k,40k], which is relatively low Bob has some stomach-related disease l-diversity does not consider semantics of sensitive values!
t-Closeness: A New Privacy Measure • Rationale • Observations • Q is public or can be derived • Potential knowledge gain from Q and Pi about Specific individuals • Principle • The distance between Q and Pi should be bounded by a threshold t. ExternalKnowledge Overall distribution Q of sensitive values Distribution Pi of sensitive values in each equi-class
t-Closeness [Li et al. ICDE ‘07] Distribution of sensitive attributes within each quasi-identifier group should be “close” to their distribution in the entire original database
Distance Measures • P=(p1,p2,…,pm), Q=(q1,q2,…,qm) • Trace-distance • KL-divergence • None of these measures reflect the semantic distance among values. • Q:{3K,4K,5K,6K,7K,8K,9K,10K,11k} P1:{3K,4K,5k} P2:{5K,7K,10K} • Intuitively, D[P1,Q]>D[P2,Q]
Earth Mover’s Distance • If the distributions are interpreted as two different ways of piling up a certain amount of dirt over region D, EMD is the minimum cost of turning one pile into the other • the cost is amount of dirt moved * the distance by which it is moved • Assume two piles have the same amount of dirt • Extensions for comparison of distributions with different total masses. • allow for a partial match, discard leftover "dirt“, without cost • allow for mass to be created or destroyed, but with a cost penalty
Earth Mover’s Distance • Formulation • P=(p1,p2,…,pm), Q=(q1,q2,…,qm) • dij: the ground distance between element i of P and element j of Q. • Find a flow F=[fij] where fij is the flow of mass from element i of P to element j of Q that minimizes the overall work: subject to the constraints:
How to calculate EMD(Cont’d) • EMD for categorical attributes • Hierarchical distance • Hierarchical distance is a metric
Earth Mover’s Distance • Example • {3k,4k,5k} and {3k,4k,5k,6k,7k,8k,9k,10k,11k} • Move 1/9 probability for each of the following pairs • 3k->6k,3k->7k cost: 1/9*(3+4)/8 • 4k->8k,4k->9k cost: 1/9*(4+5)/8 • 5k->10k,5k->11k cost: 1/9*(5+6)/8 • Total cost: 1/9*27/8=0.375 • With P2={6k,8k,11k} , we can get the total cost is 1/9 * 12/8 = 0.167 < 0.375. This make more sense than the other two distance calculation method.
Experiments • Goal • To show l-diversity does not provide sufficient privacy protection (the similarity attack). • To show the efficiency and data quality of using t-closeness are comparable with other privacy measures. • Setup • Adult dataset from UC Irvine ML repository • 30162 tuples, 9 attributes (2 sensitive attributes) • Algorithm: Incognito
Experiments • Comparisons of privacy measurements • k-Anonymity • Entropy l-diversity • Recursive (c,l)-diversity • k-Anonymity with t-closeness
Experiments • Efficiency • The efficiency of using t-closeness is comparable with other privacy measurements
Experiments • Data utility • Discernibility metric; Minimum average group size • The data quality of using t-closeness is comparable with other privacy measurements
Anonymous, “t-Close” Dataset This is k-anonymous, l-diverse and t-close… …so secure, right?
What Does Attacker Know? Bob is Caucasian and I heard he was admitted to hospital with flu…
What Does Attacker Know? Bob is Caucasian and I heard he was admitted to hospital … And I know three other Caucasions admitted to hospital with Acne or Shingles …
k-Anonymity and Partition-based notions • Syntactic • Focuses on data transformation, not on what can be learned from the anonymized dataset • “k-anonymous” dataset can leak sensitive information • “Quasi-identifier” fallacy • Assumes a priori that attacker will not know certain information about his target
Today • Permutation based anonymization methods (cont.) • Other privacy principles for microdata publishing • Statistical databases • Definitions and early methods • Output perturbation and differential privacy
Statistical Data Release • Originated from the study on statistical database • A statistical database is a database which provides statistics on subsets of records • OLAP vs. OLTP • Statistics may be performed to compute SUM, MEAN, MEDIAN, COUNT, MAX AND MIN of records
Static – a static database is made once and never changes Example: U.S. Census Dynamic – changes continuously to reflect real-time data Example: most online research databases Types of Statistical Databases
Centralized – one database Decentralized – multiple decentralized databases Types of Statistical Databases • General purpose – like census • Special purpose – like bank, hospital, academia, etc
Data Compromise • Exact compromise – a user is able to determine the exact value of a sensitive attribute of an individual • Partial compromise – a user is able to obtain an estimator for a sensitive attribute with a bounded variance • Positive compromise – determine an attribute has a particular value • Negative compromise – determine an attribute does not have a particular value • Relative compromise – determine the ranking of some confidential values
Statistical Quality of Information • Bias – difference between the unperturbed statistic and the expected value of its perturbed estimate • Precision – variance of the estimators obtained by users • Consistency – lack of contradictions and paradoxes • Contradictions: different responses to same query; average differs from sum/count • Paradox: negative count
Methods • Query restriction • Data perturbation/anonymization • Output perturbation
Output Perturbation Query Results Results Query
Statistical data release vs. data anonymization • Data anonymization is one technique that can be used to build statistical database • Other techniques such as query restriction and output purterbation can be used to build statistical database or release statistical data • Different privacy principles can be used