340 likes | 487 Views
Anonymization Algorithms - Other techniques, metrics, and extended scenarios. Li Xiong CS573 Data Privacy and Anonymity. So far. k-anonymity (protect identity disclosure) Anonymization algorithms Generalization and suppression Microaggregation and clustering
E N D
Anonymization Algorithms - Other techniques, metrics, and extended scenarios Li Xiong CS573 Data Privacy and Anonymity
So far • k-anonymity (protect identity disclosure) • Anonymization algorithms • Generalization and suppression • Microaggregation and clustering • Privacy principles beyond k-anonymity • l-diversity, t-closeness (protect attribute disclosure) • m-invariance (protect continuous publishing)
Agenda • Other anonymization technique • Anatomization • Information metrics • Extended scenarios
Anonymization methods • Non-perturbative: don't distort the data • Generalization • Suppression • Perturbative: distort the data • Microaggregation/clustering • Additive noise • Anatomization and permutation • De-associate relationship between QID and sensitive attribute
Problems with k-anonymity and l-diversity Query A: SELECT COUNT(*) FROM Microdata WHERE Disease = 'pneumonia' AND Age <= 30 AND Zipcode IN [10001,20000]
Querying generalized table • R1 and R2 are the anonymized QID groups • Q is the query range • p = Area(R1∩ RQ)/Area(R1) = (10*10)/(50*40) = 0.05 • Estimated Answer for A: 2(0.05) = 0.1
Concept of the Anatomy Algorithm • Release 2 tables, quasi-identifier table (QIT) and sensitive table (ST) • Use the same QI groups (satisfy l-diversity), replace the sensitive attribute values with a Group-ID column • Then produce a sensitive table with Disease statistics
Concept of the Anatomy Algorithm • Does it satisfy k-anonymity? l-diversity? • Query results? SELECT COUNT(*) FROM Microdata WHERE Disease = 'pneumonia' AND Age <= 30 AND Zipcode IN [10001,20000]
Specifications of Anatomy • T is representation of the microdata to be published • T has d QI attributes Aqi1,Aqi2, ..., Aqidand a sensitive attribute As • Each Aqii (1 ≤ i ≤ d ) is either numerical or categorical, but As can only be categorical because of l-diversity • t is a tuple within T and Aqii is the value of t with [d + 1] as the Asvalue • With the above stated, we can consider t to be a point in a (d +1)-dimensional data space regarded as DS
Specifications of Anatomy cont. DEFINITION 1.(Partition/QI-group) A partition is several subsets of T and only allow each tuple to belong to one subset Subsets are know as QI-groups and are denoted as follows QI1, QI2, ...,QIm
Specifications of Anatomy cont. DEFINITION 2. (l-diverse partition) A partition is considered l-diverse if it conforms to the following: v is the most frequent sensitive value in a QI-group QIj and cj(v) is the number of tuples that match v cj(v)/|QIj| ≤ 1/l |QIj| is the number of tuples of QIj c1(dyspepsia) = c1(pneumonia) = 2 and c2(flu) = 2 |QI1| = |QI2| = 4 so this satisfies the condition 2/4 ≤ 1/2
Specifications of Anatomy cont. DEFINITION 3. (Anatomy) With a given l-diverse partition anatomy will create QIT and ST tables QIT will be constructed as the following: (Aqi1,Aqi2, ..., Aqid,Group-ID) ST will be constructed as the following: (Group-ID, As, Count)
Privacy properties THEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/l
Comparison with generalization • Compare with generalization on two assumptions: • A1: the adversary has the QI-values of the target individual A2: the adversary also knows that the individual is definitely in the microdata • If A1 and A2 are true, anatomy is as good as generalization 1/l holds true • If A1 is true and A2 is false, generalization is stronger • If A1 and A2 are false, generalization is still stronger
Preserving Data Correlation • Examine the correlation between Age and Disease in T using probability density function pdf • Example: t1
Preserving Data Correlation cont. • To re-construct an approximate pdf of t1 from the generalization table:
Preserving Data Correlation cont. • To re-construct an approximate pdf of t1 from the QIT and ST tables:
Preserving Data Correlation cont. • To figure out a more rigorous comparison, calculate the “L2 distance” with the following equation: • The distance for anatomy is 0.5 while the distance for generalization is 22.5 • Anatomy provides for better re-constructions of the probability density functions of all tuples.
Preserving Data Correlation cont. • measure the error for each pdf by using the following formula: • Objective: for all tuples t in T and obtain a minimal re-construction error (RCE):
Nearly-Optimal Anatomizing Algorithm • They propose an efficient algorithm for anatomizing tables that will minimize the RCE • The resulting QIT and ST achieves an RCE that only deviates from the lower bound by a factor < 1 + 1/n, where n is the size of T • This algorithm has linear I/O complexity O(n/b) where b is the page size
Nearly-Optimal Anatomizing Algorithm cont. PROPERTY 1. At the end of the group-creation phase, each non-empty bucket has only one tuple. PROPERTY 2. The set S' always includes at least one QI-group. PROPERTY 3. After the residue-assignment phase, each QI group has at least l tuples with distinct senstive attribute value
Experiments • dataset CENSUS that contained the personal information of 500k American adults containing 9 discrete attributes • Created two sets of microdata tables • Set 1: 5 tables denoted as OCC-3, ..., OCC-7 so that OCC-d (3 ≤d≤ 7) uses the first d as QI-attributes and Occupation as the sensitive attribute As • Set 2: 5 tables denoted as SAL-3, ..., SAL-7 so that SAL-d (3 ≤d≤ 7) uses the first d as QI-attributes and Salary-class as the sensitive attribute Asg
Conclusion • Anatomy was designed to overcome the problem of generalization of losing too much data and still obtain privacy • Anatomy has a significantly lower error rate as compared with generalization • Several items would require further research • - Multiple sensitive attributes - Effective mining of patterns in microdata
Agenda • Other anonymization technique • Anatomization • Information metrics • Extended scenarios
Information Metrics • General purpose metrics • Special purpose metrics • Trade-off metrics
General Purpose Metrics • General idea: measure “similarity” between the original data and the anonymized data • Minimal distortion metric (Samarati 2001; Sweeney 2002, Wang and Fung 2006) • Charge a penalty to each instance of a value generalized or suppressed (independently of other records) • ILoss (Xiao and Tao 2006) • Charge a penalty when a specific value is generalized
General Purpose Metrics cont. • Discernibility Metric (DM) (K-OPTIMIZE, Mondrian, l-diversity …) • Charge a penalty to each record for being indistinguishable from other records
Special Purpose Metrics • Classification: Classification metric (CM) (Iyengar 2002) • Charge a penalty for each record suppressed or generalized to a group in which the record’s class is not the majority class • Query • Query error: count queries • Query imprecision: overlapped range
Extended Scenarios • Multiple release publishing • Continuous release publishing • Collaborative/distributed publishing
Other types of data • High dimensional transaction data • Market basket, web queries • Moving objects data • Location based services • Textual data