Anonymization Algorithms - Other techniques, metrics, and extended scenarios

Anonymization Algorithms - Other techniques, metrics, and extended scenarios Li Xiong CS573 Data Privacy and Anonymity

So far • k-anonymity (protect identity disclosure) • Anonymization algorithms • Generalization and suppression • Microaggregation and clustering • Privacy principles beyond k-anonymity • l-diversity, t-closeness (protect attribute disclosure) • m-invariance (protect continuous publishing)

Agenda • Other anonymization technique • Anatomization • Information metrics • Extended scenarios

Anonymization methods • Non-perturbative: don't distort the data • Generalization • Suppression • Perturbative: distort the data • Microaggregation/clustering • Additive noise • Anatomization and permutation • De-associate relationship between QID and sensitive attribute

Problems with k-anonymity and l-diversity Query A: SELECT COUNT(*) FROM Microdata WHERE Disease = 'pneumonia' AND Age <= 30 AND Zipcode IN [10001,20000]

Querying generalized table • R1 and R2 are the anonymized QID groups • Q is the query range • p = Area(R1∩ RQ)/Area(R1) = (10*10)/(50*40) = 0.05 • Estimated Answer for A: 2(0.05) = 0.1

Concept of the Anatomy Algorithm • Release 2 tables, quasi-identifier table (QIT) and sensitive table (ST) • Use the same QI groups (satisfy l-diversity), replace the sensitive attribute values with a Group-ID column • Then produce a sensitive table with Disease statistics

Concept of the Anatomy Algorithm • Does it satisfy k-anonymity? l-diversity? • Query results? SELECT COUNT(*) FROM Microdata WHERE Disease = 'pneumonia' AND Age <= 30 AND Zipcode IN [10001,20000]

Specifications of Anatomy • T is representation of the microdata to be published • T has d QI attributes Aqi1,Aqi2, ..., Aqidand a sensitive attribute As • Each Aqii (1 ≤ i ≤ d ) is either numerical or categorical, but As can only be categorical because of l-diversity • t is a tuple within T and Aqii is the value of t with [d + 1] as the Asvalue • With the above stated, we can consider t to be a point in a (d +1)-dimensional data space regarded as DS

Specifications of Anatomy cont. DEFINITION 1.(Partition/QI-group) A partition is several subsets of T and only allow each tuple to belong to one subset Subsets are know as QI-groups and are denoted as follows QI1, QI2, ...,QIm

Specifications of Anatomy cont. DEFINITION 2. (l-diverse partition) A partition is considered l-diverse if it conforms to the following: v is the most frequent sensitive value in a QI-group QIj and cj(v) is the number of tuples that match v cj(v)/|QIj| ≤ 1/l |QIj| is the number of tuples of QIj c1(dyspepsia) = c1(pneumonia) = 2 and c2(flu) = 2 |QI1| = |QI2| = 4 so this satisfies the condition 2/4 ≤ 1/2

Specifications of Anatomy cont. DEFINITION 3. (Anatomy) With a given l-diverse partition anatomy will create QIT and ST tables QIT will be constructed as the following: (Aqi1,Aqi2, ..., Aqid,Group-ID) ST will be constructed as the following: (Group-ID, As, Count)

Privacy properties THEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/l

Comparison with generalization • Compare with generalization on two assumptions: • A1: the adversary has the QI-values of the target individual A2: the adversary also knows that the individual is definitely in the microdata • If A1 and A2 are true, anatomy is as good as generalization 1/l holds true • If A1 is true and A2 is false, generalization is stronger • If A1 and A2 are false, generalization is still stronger

Preserving Data Correlation • Examine the correlation between Age and Disease in T using probability density function pdf • Example: t1

Preserving Data Correlation cont. • To re-construct an approximate pdf of t1 from the generalization table:

Preserving Data Correlation cont. • To re-construct an approximate pdf of t1 from the QIT and ST tables:

Preserving Data Correlation cont. • To figure out a more rigorous comparison, calculate the “L2 distance” with the following equation: • The distance for anatomy is 0.5 while the distance for generalization is 22.5 • Anatomy provides for better re-constructions of the probability density functions of all tuples.

Preserving Data Correlation cont. • measure the error for each pdf by using the following formula: • Objective: for all tuples t in T and obtain a minimal re-construction error (RCE):

Nearly-Optimal Anatomizing Algorithm • They propose an efficient algorithm for anatomizing tables that will minimize the RCE • The resulting QIT and ST achieves an RCE that only deviates from the lower bound by a factor < 1 + 1/n, where n is the size of T • This algorithm has linear I/O complexity O(n/b) where b is the page size

Nearly-Optimal Anatomizing Algorithm cont. PROPERTY 1. At the end of the group-creation phase, each non-empty bucket has only one tuple. PROPERTY 2. The set S' always includes at least one QI-group. PROPERTY 3. After the residue-assignment phase, each QI group has at least l tuples with distinct senstive attribute value

Experiments • dataset CENSUS that contained the personal information of 500k American adults containing 9 discrete attributes • Created two sets of microdata tables • Set 1: 5 tables denoted as OCC-3, ..., OCC-7 so that OCC-d (3 ≤d≤ 7) uses the first d as QI-attributes and Occupation as the sensitive attribute As • Set 2: 5 tables denoted as SAL-3, ..., SAL-7 so that SAL-d (3 ≤d≤ 7) uses the first d as QI-attributes and Salary-class as the sensitive attribute Asg

Experiments cont.

Conclusion • Anatomy was designed to overcome the problem of generalization of losing too much data and still obtain privacy • Anatomy has a significantly lower error rate as compared with generalization • Several items would require further research • - Multiple sensitive attributes - Effective mining of patterns in microdata

Agenda • Other anonymization technique • Anatomization • Information metrics • Extended scenarios

Information Metrics • General purpose metrics • Special purpose metrics • Trade-off metrics

General Purpose Metrics • General idea: measure “similarity” between the original data and the anonymized data • Minimal distortion metric (Samarati 2001; Sweeney 2002, Wang and Fung 2006) • Charge a penalty to each instance of a value generalized or suppressed (independently of other records) • ILoss (Xiao and Tao 2006) • Charge a penalty when a specific value is generalized

General Purpose Metrics cont. • Discernibility Metric (DM) (K-OPTIMIZE, Mondrian, l-diversity …) • Charge a penalty to each record for being indistinguishable from other records

Special Purpose Metrics • Classification: Classification metric (CM) (Iyengar 2002) • Charge a penalty for each record suppressed or generalized to a group in which the record’s class is not the majority class • Query • Query error: count queries • Query imprecision: overlapped range

Extended Scenarios • Multiple release publishing • Continuous release publishing • Collaborative/distributed publishing

Other types of data • High dimensional transaction data • Market basket, web queries • Moving objects data • Location based services • Textual data

Anonymization Algorithms - Other techniques, metrics, and extended scenarios

Anonymization Algorithms - Other techniques, metrics, and extended scenarios

Presentation Transcript

ITEC113 Algorithms and Programming Techniques

Anonymization Algorithms - Microaggregation and Clustering

Computer Animation Algorithms and Techniques

Other Algorithms

ITEC113 Algorithms and Programming Techniques

HEW scenarios and evaluation metrics

Simulation scenarios and metrics for HEW

Computer Animation Algorithms and Techniques

Combination of Evolutionary Algorithms with Other Techniques

Data Anonymization - Generalization Algorithms

Computer Animation Algorithms and Techniques

Selection of Measurement Techniques and Metrics

HEW- Metrics, Targets, Simulation Scenarios

Finance ROI and Other Metrics

Computer Animation Algorithms and Techniques

Algorithms and Programming Techniques

Computer Animation Algorithms and Techniques

Computer Animation Algorithms and Techniques

Computer Animation Algorithms and Techniques

Computer Animation Algorithms and Techniques

Computer Animation Algorithms and Techniques

Computer Animation Algorithms and Techniques