K-Anonymity and Other Cluster-Based Methods

K-Anonymity and Other Cluster-Based Methods Ge RuanOct. 11,2007

Data Publishing and Data Privacy • Society is experiencing exponential growth in the number and variety of data collections containing person-specific information. • These collected information is valuable both in research and business. Data sharing is common. • Publishing the data may put the respondent’s privacy in risk. • Objective: • Maximize data utility while limiting disclosure risk to an acceptable level

Related Works • Statistical Databases • The most common way is adding noise and still maintaining some statistical invariant. Disadvantages: • destroy the integrity of the data

Related Works(Cont’d) • Multi-level Databases • Data is stored at different security classifications and users having different security clearances. (Denning and Lunt) • Eliminating precise inference. Sensitive information is suppressed, i.e. simply not released. (Su and Ozsoyoglu) Disadvantages: • It is impossible to consider every possible attack • Many data holders share same data. But their concerns are different. • Suppression can drastically reduce the quality of the data.

Related Works (Cont’d) • Computer Security • Access control and authentication ensure that right people has right authority to the right object at right time and right place. • That’s not what we want here. A general doctrine of data privacy is to release all the information as much as the identities of the subjects (people) are protected.

K-Anonymity Sweeny came up with a formal protection model named k-anonymity • What is K-Anonymity? • If the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release. • Ex. If you try to identify a man from a release, but the only information you have is his birth date and gender. There are k people meet the requirement. This is k-Anonymity.

Classification of Attributes • Key Attribute: • Name, Address, Cell Phone • which can uniquely identify an individual directly • Always removed before release. • Quasi-Identifier: • 5-digit ZIP code,Birth date, gender • A set of attributes that can be potentially linked with external information to re-identify entities • 87% of the population in U.S. can be uniquely identified based on these attributes, according to the Census summary data in 1991. • Suppressed or generalized

Classification of Attributes(Cont’d) Hospital Patient Data Vote Registration Data • Andre has heart disease!

Classification of Attributes(Cont’d) • Sensitive Attribute: • Medical record, wage,etc. • Always released directly. These attributes is what the researchers need. It depends on the requirement.

K-Anonymity Protection Model • PT: Private Table • RT,GT1,GT2: Released Table • QI: Quasi Identifier (Ai,…,Aj) • (A1,A2,…,An): Attributes Lemma:

Attacks Against K-Anonymity • Unsorted Matching Attack • This attack is based on the order in which tuples appear in the released table. • Solution: • Randomly sort the tuples before releasing.

Attacks Against K-Anonymity(Cont’d) • Complementary Release Attack • Different releases can be linked together to compromise k-anonymity. • Solution: • Consider all of the released tables before release the new one, and try to avoid linking. • Other data holders may release some data that can be used in this kind of attack. Generally, this kind of attack is hard to be prohibited completely.

Attacks Against K-Anonymity(Cont’d) • Complementary Release Attack (Cont’d)

Attacks Against K-Anonymity(Cont’d) • Temporal Attack (Cont’d) • Adding or removing tuples may compromise k-anonymity protection.

Attacks Against K-Anonymity(Cont’d) • k-Anonymity does not provide privacy if: • Sensitive values in an equivalence class lack diversity • The attacker has background knowledge Homogeneity Attack A 3-anonymous patient table Background Knowledge Attack A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

l-Diversity • Distinct l-diversity • Each equivalence class has at least l well-represented sensitive values • Limitation: • Doesn’t preventthe probabilistic inference attacks • Ex. In one equivalent class, there are ten tuples. In the “Disease” area, one of them is “Cancer”, one is “Heart Disease” and the remaining eight are “Flu”. This satisfies 3-diversity, but the attacker can still affirm that the target person’s disease is “Flu” with the accuracy of 70%. A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

l-Diversity(Cont’d) • Entropy l-diversity • Each equivalence class not only must have enough different sensitive values, but also the different sensitive values must be distributed evenly enough. • In the formal language of statistic, it means the entropy of the distribution of sensitive values in each equivalence class is at least log(l) • Sometimes this maybe too restrictive. When some values are very common, the entropy of the entire table may be very low. This leads to the less conservative notion of l-diversity. A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

l-Diversity(Cont’d) • Recursive (c,l)-diversity • The most frequent value does not appear too frequently • r1<c(rl+rl+1+…+rm) A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

Limitations of l-Diversity l-diversity may be difficult and unnecessary to achieve. • A single sensitive attribute • Two values: HIV positive (1%) and HIV negative (99%) • Very different degrees of sensitivity • l-diversity is unnecessary to achieve • 2-diversity is unnecessary for an equivalence class that contains only negative records • l-diversity is difficult to achieve • Suppose there are 10000 records in total • To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classes

Limitations of l-Diversity(Cont’d) l-diversity is insufficient to prevent attribute disclosure. Skewness Attack • Two sensitive values • HIV positive (1%) and HIV negative (99%) • Serious privacy risk • Consider an equivalence class that contains an equal number of positive records and negative records • l-diversity does not differentiate: • Equivalence class 1: 49 positive + 1 negative • Equivalence class 2: 1 positive + 49 negative l-diversity does not consider the overall distribution of sensitive values

Limitations of l-Diversity(Cont’d) l-diversity is insufficient to prevent attribute disclosure. Similarity Attack A 3-diverse patient table Conclusion • Bob’s salary is in [20k,40k], which is relative low. • Bob has some stomach-related disease. l-diversity does not consider semantic meanings of sensitive values

t-Closeness: A New Privacy Measure • Rationale A completely generalized table ExternalKnowledge Overall distribution Q of sensitive values

t-Closeness: A New Privacy Measure • Rationale A released table ExternalKnowledge Overall distribution Q of sensitive values Distribution Pi of sensitive values in each equi-class

t-Closeness: A New Privacy Measure • Rationale • Observations • Q should be public • Knowledge gain in two parts: • Whole population (from B0 to B1) • Specific individuals (from B1 to B2) • We bound knowledge gain between B1 and B2 instead • Principle • The distance between Q and Pi should be bounded by a threshold t. ExternalKnowledge Overall distribution Q of sensitive values Distribution Pi of sensitive values in each equi-class

Distance Measures • P=(p1,p2,…,pm), Q=(q1,q2,…,qm) • Trace-distance • KL-divergence • None of these measures reflect the semantic distance among values. • Q:{3K,4K,5K,6K,7K,8K,9K,10K,11k} P1:{3K,4K,5k} P2:{5K,7K,10K} • Intuitively, D[P1,Q]>D[P2,Q] • Ground distance for any pair of values D[P,Q] is dependent upon the ground distances.

Earth Mover’s Distance • Formulation • P=(p1,p2,…,pm), Q=(q1,q2,…,qm) • dij: the ground distance between element i of P and element j of Q. • Find a flow F=[fij] where fij is the flow of mass from element i of P to element j of Q that minimizes the overall work: subject to the constraints:

Earth Mover’s Distance • Example • {3k,4k,5k} and {3k,4k,5k,6k,7k,8k,9k,10k,11k} • Move 1/9 probability for each of the following pairs • 3k->6k,3k->7k cost: 1/9*(3+4)/8 • 4k->8k,4k->9k cost: 1/9*(4+5)/8 • 5k->10k,5k->11k cost: 1/9*(5+6)/8 • Total cost: 1/9*27/8=0.375 • With P2={6k,8k,11k} , we can get the total cost is 0.167 < 0.375. This make more sense than the other two distance calculation method.

How to calculate EMD • EMD for numerical attributes • Ordered distance • Ordered-distance is a metric • Non-negative, symmetry, triangle inequality • Let ri=pi-qi, then D[P,Q] is calculated as:

How to calculate EMD • EMD for categorical attributes • Equal distance • Equal-distance is a metric • D[P,Q] is calculated as:

How to calculate EMD(Cont’d) • EMD for categorical attributes • Hierarchical distance • Hierarchical distance is a metric

How to calculate EMD(Cont’d) • EMD for categorical attributes • D[P,Q] is calculated as:

Experiments • Goal • To show l-diversity does not provide sufficient privacy protection (the similarity attack). • To show the efficiency and data quality of using t-closeness are comparable with other privacy measures. • Setup • Adult dataset from UC Irvine ML repository • 30162 tuples, 9 attributes (2 sensitive attributes) • Algorithm: Incognito

Experiments • Similarity attack (Occupation) • 13 of 21 entropy 2-diversity tables are vulnerable • 17 of 26 recursive (4,4)-diversity tables are vulnerable • Comparisons of privacy measurements • k-Anonymity • Entropy l-diversity • Recursive (c,l)-diversity • k-Anonymity with t-closeness

Experiments • Efficiency • The efficiency of using t-closeness is comparable with other privacy measurements

Experiments • Data utility • Discernibility metric; Minimum average group size • The data quality of using t-closeness is comparable with other privacy measurements

Conclusion • Limitations of l-diversity • l-diversity is difficult and unnecessary to achieve • l-diversity is insufficient in preventing attribute disclosure • t-Closeness as a new privacy measure • The overall distribution of sensitive values should be public information • The separation of the knowledge gain • EMD to measure distance • EMD captures semantic distance well • Simple formulas for three ground distances

Questions? Thank you!

K-Anonymity and Other Cluster-Based Methods