Data Anonymization (1)

Data Anonymization (1)

Outline • Problem • concepts • algorithms on domain generalization hierarchy • Algorithms on numerical data

The Massachusetts Governor Privacy Breach • Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. • Name linked to Diagnosis 87 % of US population • Name • SSN • Visit Date • Diagnosis • Procedure • Medication • Total Charge • Name • Address • Date Registered • Party affiliation • Date last voted • Zip • Birth date • Sex • Zip • Birth date • Sex Quasi Identifier Medical Data Voter List Sweeney, IJUFKS 2002 3

Definition • Table • Column: attributes, row: records • Quasi-identifier • A list of attributes that can potentially be used to identify individuals • K-anonymity • Any QI in the table appears at least k times

Basic techniques • Generalization • Zip {02138, 02139}  0213* • Domain generalization hierarchy • A0 A1…An • Eg. {02138, 02139}  0213*  021* 02*0** • This hierarchy is a tree structure suppression

Balance Better privacy guarantee Lower data utility • There are many schemes satisfying the k-anonymity specification. • We want to minimize the distortion of table, in order to maximize • data utility • Suppression is required if we cannot find a k-anonymity group for • a record.

Criteria • Minimal generalization • Minimal generalization that satisfy the k-anonymization specification • Minimal table distortion • Minimal generalization with minimal utility loss • Use precision to evaluate the loss [sweeny papers] • Application-specific utility

Complexity of finding optimal solution on generalization • NP-hard (bayardo ICDE05) • So all proposed algorithms are approximate algorithms

Shared features in different solutions • Always satisfy the k-anonymity specification • If some records not, suppress them • Differences are at the utility loss/cost function • Sweeney’s precision metric • Discernibility & classification metrics • Information-privacy metric • Algorithms • Assume the domain generalization hierarchy is given • Efficiency • Utility maximization

Metrics to be optimized • Two cost metrics – we want to minimize (bayardo ICDE05) • Discernibility • Classification • The dataset has a class label column – preserving the classification model # of items in the k-anony group # Records in minor classes in the group

metrics • A combination of information loss and anonymity gain (wang ICDE04) • Information loss, anonymity gain • Information-privacy metric

metrics • Information loss • Dataset has class labels • Entropy • a set S, labeled by different classes • Entropy is used to calculate the impurity of labels • Information loss of a generalization G {c1,c2,…cn}  p I(G) = info(Sp) - info (Rci) Info(S)= Pi is the percentage of label i

Anonymity gain • A(VID) : # of records with the VID • AG(VID) >= A(VID): generalization improves or does not change A(VID) • Anonymity gain P(G) = x – A(VID) x = AG (VID) if AG (VID) <=K x = K, otherwise As long as k-anonymity is satisfied, further generalization of the VID does not gain

Information-privacy combined metric IP = info loss/anonymity gain = I(G)/P(G) We want to minimize IP If P(G) ==0, use I(G) only Either small I(G) or large P(G) will reduce IP… If P(G)s are same, pick one with minimum I(G)

Domain-hierarchy based algorithms • The sweeny’s algorithm • Bayardo’s tree pruning algorithm • Wang’s top-down and bottom up algorithms • They are all dimension-by-dimension methods

Multidimensional techniques • Categorical data? • Categories are mapped to • numerize the categories • Bayardo 95 paper • Order matters? (no research on that) • Numerical data • K-anonymization  n-dim space partitioning • Many existing techniques can be applied

Single-dimensional vs. multidimensional

The evolving procedure Categorical(domain hierarchy)[sweeney, top-down/bottom-up]  numerized categories, single dimensional [bayardo05] numerized/numerical multidimensional[Mondrian,spatial indexing,…]

Method 1: Mondrain • Numerize categorical data • Apply a top-down partioning process Step2.2 Step2.1 step1

Allowable cut

Method 2: spatial indexing • Multidimensional spatial techniques • Kd-tree (similar to Mondrain algorithm) • R-tree and its variations Upper layer Leaf layer R+-tree R-tree

Compacting bounds Information is better preserved Example: uncompacted: age[1-80], salary[10k-100k] compacted: age[20-40], salary[10k-50k] Original Mondrain does not consider compacting bounds For R+-Tree, it is automatically done.

Benefits of using R+-Tree • Scalable: originally designed for indexing disk-based large data • Multi-granularity k-anonymity: layers • Better performance • Better quality

Performance Mondrain

Utility • Metrics • Discenibility penalty • KL divergence: describe the difference between a pair of distributions • Certainty penalty Anonymized data distribution T: table, t: record, m: # of attributes, t.Ai generaled range, T.Ai total range

Other issues • Sparse high-dimensionality • Transactional data boolean matrix “On the anonymization of sparse high-dimensional data” ICDE08 • Relate to the clustering problem of transactional data! • The above one uses matrix-based clustering • item based clustering (?)

Other issues • Effect of numerizing categorical data • Ordering of categories may have certain impact on quality • General-purpose utility metrics vs. special task oriented utility metrics • Attacks on k-anonymity definition

Data Anonymization (1)

Data Anonymization (1)

Presentation Transcript

Confidentiality and Anonymization of Microdata

Anonymization of Set-Valued Data via Top-Down, Local Generalization

Pattern-Preserving k-Anonymization of sequences and its Application to Mobility Data Mining

Structure Preserving Anonymization of Router Configuration Data

CS573 Data Privacy and Security Anonymization methods

integrating Data for Analysis, Anonymization , and Sharing Lucila Ohno-Machado, UCSD N A-MIC All Hands Meeting 1/12/12

Privacy-preserving Anonymization of Set Value Data

Anonymization Algorithms - Microaggregation and Clustering

A Unified RFID Data Anonymization Platform Rui Chen and Benjamin C. M. Fung

Anonymization of Health Care Data in Hungary

1. Data

Class-based graph anonymization for social network data

Data Anonymization - Generalization Algorithms

data[0] data[1]

On the Anonymization of Sparse High-Dimensional Data

Privacy-preserving Anonymization of Set Value Data

Towards Publishing Recommendation Data With Predictive Anonymization

Data Anonymization – Introduction and k-anonymity