CS573 Data Privacy and Security Anonymization methods

CS573 Data Privacy and SecurityAnonymization methods Li Xiong

Today • Permutation based anonymization methods (cont.) • Other privacy principles for microdata publishing • Statistical databases

Anonymization methods • Non-perturbative: don't distort the data • Generalization • Suppression • Perturbative: distort the data • Microaggregation/clustering • Additive noise • Anatomization and permutation • De-associate relationship between QID and sensitive attribute

Concept of the Anatomy Algorithm • Release 2 tables, quasi-identifier table (QIT) and sensitive table (ST) • Use the same QI groups (satisfy l-diversity), replace the sensitive attribute values with a Group-ID column • Then produce a sensitive table with Disease statistics

Specifications of Anatomy cont. DEFINITION 3. (Anatomy) With a given l-diverse partition anatomy will create QIT and ST tables QIT will be constructed as the following: (Aqi1,Aqi2, ..., Aqid,Group-ID) ST will be constructed as the following: (Group-ID, As, Count)

Privacy properties THEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/l

Comparison with generalization • Compare with generalization on two assumptions: • A1: the adversary has the QI-values of the target individual A2: the adversary also knows that the individual is definitely in the microdata • If A1 and A2 are true, anatomy is as good as generalization 1/l holds true • If A1 is true and A2 is false, generalization is stronger • If A1 and A2 are false, generalization is still stronger

Preserving Data Correlation • Examine the correlation between Age and Disease in T using probability density function pdf • Example: t1

Preserving Data Correlation cont. • To re-construct an approximate pdf of t1 from the generalization table:

Preserving Data Correlation cont. • To re-construct an approximate pdf of t1 from the QIT and ST tables:

Preserving Data Correlation cont. • To figure out a more rigorous comparison, calculate the “L2 distance” with the following equation: • The distance for anatomy is 0.5 while the distance for generalization is 22.5

Preserving Data Correlation cont. Idea: Measure the error for each tuple by using the following formula: Objective: for all tuplest in T and obtain a minimal re-construction error (RCE): Algorithm: Nearly-Optimal Anatomizing Algorithm

Experiments • dataset CENSUS that contained the personal information of 500k American adults containing 9 discrete attributes • Created two sets of microdata tables • Set 1: 5 tables denoted as OCC-3, ..., OCC-7 so that OCC-d (3 ≤d≤ 7) uses the first d as QI-attributes and Occupation as the sensitive attribute As • Set 2: 5 tables denoted as SAL-3, ..., SAL-7 so that SAL-d (3 ≤d≤ 7) uses the first d as QI-attributes and Salary-class as the sensitive attribute Asg

Experiments cont.

Today • Permutation based anonymization methods (cont.) • Other privacy principles for microdata publishing • Statistical databases • Differential privacy

Attacks on k-Anonymity • k-Anonymity does not provide privacy if • Sensitive values in an equivalence class lack diversity • The attacker has background knowledge A 3-anonymous patient table Homogeneity attack Background knowledge attack

l-Diversity [Machanavajjhala et al. ICDE ‘06] Sensitive attributes must be “diverse” within each quasi-identifier equivalence class

Distinct l-Diversity • Each equivalence class has at least l well-represented sensitive values • Doesn’t prevent probabilistic inference attacks 8 records have HIV 10 records 2 records have other values

Other Versions of l-Diversity • Probabilistic l-diversity • The frequency of the most frequent value in an equivalence class is bounded by 1/l • Entropy l-diversity • The entropy of the distribution of sensitive values in each equivalence class is at least log(l) • Recursive (c,l)-diversity • r1<c(rl+rl+1+…+rm) where ri is the frequency of the ith most frequent value • Intuition: the most frequent value does not appear too frequently

Neither Necessary, Nor Sufficient Original dataset 99% have cancer

Neither Necessary, Nor Sufficient Original dataset Anonymization A 50% cancer  quasi-identifier group is “diverse” 99% have cancer

Neither Necessary, Nor Sufficient Original dataset Anonymization A Anonymization B 99% cancer  quasi-identifier group is not “diverse” 50% cancer  quasi-identifier group is “diverse” This leaks a ton of information 99% have cancer

Limitations of l-Diversity • Example: sensitive attribute is HIV+ (1%) or HIV- (99%) • Very different degrees of sensitivity! • l-diversity is unnecessary • 2-diversity is unnecessary for an equivalence class that contains only HIV- records • l-diversity is difficult to achieve • Suppose there are 10000 records in total • To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classes

Skewness Attack • Example: sensitive attribute is HIV+ (1%) or HIV- (99%) • Consider an equivalence class that contains an equal number of HIV+ and HIV- records • Diverse, but potentially violates privacy! • l-diversity does not differentiate: • Equivalence class 1: 49 HIV+ and 1 HIV- • Equivalence class 2: 1 HIV+ and 49 HIV- l-diversity does not consider overall distribution of sensitive values!

Sensitive Attribute Disclosure A 3-diverse patient table Similarity attack Conclusion Bob’s salary is in [20k,40k], which is relatively low Bob has some stomach-related disease l-diversity does not consider semantics of sensitive values!

t-Closeness: A New Privacy Measure • Rationale • Observations • Q is public or can be derived • Potential knowledge gain from Q and Pi about Specific individuals • Principle • The distance between Q and Pi should be bounded by a threshold t. ExternalKnowledge Overall distribution Q of sensitive values Distribution Pi of sensitive values in each equi-class

t-Closeness [Li et al. ICDE ‘07] Distribution of sensitive attributes within each quasi-identifier group should be “close” to their distribution in the entire original database

Distance Measures • P=(p1,p2,…,pm), Q=(q1,q2,…,qm) • Trace-distance • KL-divergence • None of these measures reflect the semantic distance among values. • Q:{3K,4K,5K,6K,7K,8K,9K,10K,11k} P1:{3K,4K,5k} P2:{5K,7K,10K} • Intuitively, D[P1,Q]>D[P2,Q]

Earth Mover’s Distance • If the distributions are interpreted as two different ways of piling up a certain amount of dirt over region D, EMD is the minimum cost of turning one pile into the other • the cost is amount of dirt moved * the distance by which it is moved • Assume two piles have the same amount of dirt • Extensions for comparison of distributions with different total masses. • allow for a partial match, discard leftover "dirt“, without cost • allow for mass to be created or destroyed, but with a cost penalty

Earth Mover’s Distance • Formulation • P=(p1,p2,…,pm), Q=(q1,q2,…,qm) • dij: the ground distance between element i of P and element j of Q. • Find a flow F=[fij] where fij is the flow of mass from element i of P to element j of Q that minimizes the overall work: subject to the constraints:

How to calculate EMD(Cont’d) • EMD for categorical attributes • Hierarchical distance • Hierarchical distance is a metric

Earth Mover’s Distance • Example • {3k,4k,5k} and {3k,4k,5k,6k,7k,8k,9k,10k,11k} • Move 1/9 probability for each of the following pairs • 3k->6k,3k->7k cost: 1/9*(3+4)/8 • 4k->8k,4k->9k cost: 1/9*(4+5)/8 • 5k->10k,5k->11k cost: 1/9*(5+6)/8 • Total cost: 1/9*27/8=0.375 • With P2={6k,8k,11k} , we can get the total cost is 1/9 * 12/8 = 0.167 < 0.375. This make more sense than the other two distance calculation method.

Experiments • Goal • To show l-diversity does not provide sufficient privacy protection (the similarity attack). • To show the efficiency and data quality of using t-closeness are comparable with other privacy measures. • Setup • Adult dataset from UC Irvine ML repository • 30162 tuples, 9 attributes (2 sensitive attributes) • Algorithm: Incognito

Experiments • Comparisons of privacy measurements • k-Anonymity • Entropy l-diversity • Recursive (c,l)-diversity • k-Anonymity with t-closeness

Experiments • Efficiency • The efficiency of using t-closeness is comparable with other privacy measurements

Experiments • Data utility • Discernibility metric; Minimum average group size • The data quality of using t-closeness is comparable with other privacy measurements

Anonymous, “t-Close” Dataset This is k-anonymous, l-diverse and t-close… …so secure, right?

What Does Attacker Know? Bob is Caucasian and I heard he was admitted to hospital with flu…

What Does Attacker Know? Bob is Caucasian and I heard he was admitted to hospital … And I know three other Caucasions admitted to hospital with Acne or Shingles …

k-Anonymity and Partition-based notions • Syntactic • Focuses on data transformation, not on what can be learned from the anonymized dataset • “k-anonymous” dataset can leak sensitive information • “Quasi-identifier” fallacy • Assumes a priori that attacker will not know certain information about his target

Today • Permutation based anonymization methods (cont.) • Other privacy principles for microdata publishing • Statistical databases • Definitions and early methods • Output perturbation and differential privacy

Statistical Data Release • Originated from the study on statistical database • A statistical database is a database which provides statistics on subsets of records • OLAP vs. OLTP • Statistics may be performed to compute SUM, MEAN, MEDIAN, COUNT, MAX AND MIN of records

Static – a static database is made once and never changes Example: U.S. Census Dynamic – changes continuously to reflect real-time data Example: most online research databases Types of Statistical Databases

Centralized – one database Decentralized – multiple decentralized databases Types of Statistical Databases • General purpose – like census • Special purpose – like bank, hospital, academia, etc

Data Compromise • Exact compromise – a user is able to determine the exact value of a sensitive attribute of an individual • Partial compromise – a user is able to obtain an estimator for a sensitive attribute with a bounded variance • Positive compromise – determine an attribute has a particular value • Negative compromise – determine an attribute does not have a particular value • Relative compromise – determine the ranking of some confidential values

Statistical Quality of Information • Bias – difference between the unperturbed statistic and the expected value of its perturbed estimate • Precision – variance of the estimators obtained by users • Consistency – lack of contradictions and paradoxes • Contradictions: different responses to same query; average differs from sum/count • Paradox: negative count

Methods • Query restriction • Data perturbation/anonymization • Output perturbation

Data Perturbation

Output Perturbation Query Results Results Query

Statistical data release vs. data anonymization • Data anonymization is one technique that can be used to build statistical database • Other techniques such as query restriction and output purterbation can be used to build statistical database or release statistical data • Different privacy principles can be used

CS573 Data Privacy and Security Anonymization methods

CS573 Data Privacy and Security Anonymization methods

Presentation Transcript

Security and Privacy of Visual Data

Data Privacy and Security

Privacy-preserving Anonymization of Set Value Data

CS573 Data privacy and security in the cloud

CS573 Data Privacy and Security

CS573 Data Privacy and Security Statistical Databases

Li Xiong CS573 Data Privacy and Security

Li Xiong CS573 Data Privacy and Security

Li Xiong CS573 Data Privacy and Security

CS573 Data Privacy and Security Introduction

Li Xiong CS573 Data Privacy and Security

Privacy, Security and Data Sharing Committee

Data Anonymization (1)

Privacy, Security and Data Sharing Committee

(Health) Big Data, privacy and security

VA Research Data Security and Privacy

Data Mining, Security and Privacy

Li Xiong CS573 Data Privacy and Security

Privacy-preserving Anonymization of Set Value Data

The Survey of Transferring Cryptocurrency Anonymization, Privacy, and Security