310 likes | 672 Views
Data Anonymization (1). Outline. Problem concepts algorithms on domain generalization hierarchy Algorithms on numerical data. The Massachusetts Governor Privacy Breach. Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis.
E N D
Outline • Problem • concepts • algorithms on domain generalization hierarchy • Algorithms on numerical data
The Massachusetts Governor Privacy Breach • Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. • Name linked to Diagnosis 87 % of US population • Name • SSN • Visit Date • Diagnosis • Procedure • Medication • Total Charge • Name • Address • Date Registered • Party affiliation • Date last voted • Zip • Birth date • Sex • Zip • Birth date • Sex Quasi Identifier Medical Data Voter List Sweeney, IJUFKS 2002 3
Definition • Table • Column: attributes, row: records • Quasi-identifier • A list of attributes that can potentially be used to identify individuals • K-anonymity • Any QI in the table appears at least k times
Basic techniques • Generalization • Zip {02138, 02139} 0213* • Domain generalization hierarchy • A0 A1…An • Eg. {02138, 02139} 0213* 021* 02*0** • This hierarchy is a tree structure suppression
Balance Better privacy guarantee Lower data utility • There are many schemes satisfying the k-anonymity specification. • We want to minimize the distortion of table, in order to maximize • data utility • Suppression is required if we cannot find a k-anonymity group for • a record.
Criteria • Minimal generalization • Minimal generalization that satisfy the k-anonymization specification • Minimal table distortion • Minimal generalization with minimal utility loss • Use precision to evaluate the loss [sweeny papers] • Application-specific utility
Complexity of finding optimal solution on generalization • NP-hard (bayardo ICDE05) • So all proposed algorithms are approximate algorithms
Shared features in different solutions • Always satisfy the k-anonymity specification • If some records not, suppress them • Differences are at the utility loss/cost function • Sweeney’s precision metric • Discernibility & classification metrics • Information-privacy metric • Algorithms • Assume the domain generalization hierarchy is given • Efficiency • Utility maximization
Metrics to be optimized • Two cost metrics – we want to minimize (bayardo ICDE05) • Discernibility • Classification • The dataset has a class label column – preserving the classification model # of items in the k-anony group # Records in minor classes in the group
metrics • A combination of information loss and anonymity gain (wang ICDE04) • Information loss, anonymity gain • Information-privacy metric
metrics • Information loss • Dataset has class labels • Entropy • a set S, labeled by different classes • Entropy is used to calculate the impurity of labels • Information loss of a generalization G {c1,c2,…cn} p I(G) = info(Sp) - info (Rci) Info(S)= Pi is the percentage of label i
Anonymity gain • A(VID) : # of records with the VID • AG(VID) >= A(VID): generalization improves or does not change A(VID) • Anonymity gain P(G) = x – A(VID) x = AG (VID) if AG (VID) <=K x = K, otherwise As long as k-anonymity is satisfied, further generalization of the VID does not gain
Information-privacy combined metric IP = info loss/anonymity gain = I(G)/P(G) We want to minimize IP If P(G) ==0, use I(G) only Either small I(G) or large P(G) will reduce IP… If P(G)s are same, pick one with minimum I(G)
Domain-hierarchy based algorithms • The sweeny’s algorithm • Bayardo’s tree pruning algorithm • Wang’s top-down and bottom up algorithms • They are all dimension-by-dimension methods
Multidimensional techniques • Categorical data? • Categories are mapped to • numerize the categories • Bayardo 95 paper • Order matters? (no research on that) • Numerical data • K-anonymization n-dim space partitioning • Many existing techniques can be applied
The evolving procedure Categorical(domain hierarchy)[sweeney, top-down/bottom-up] numerized categories, single dimensional [bayardo05] numerized/numerical multidimensional[Mondrian,spatial indexing,…]
Method 1: Mondrain • Numerize categorical data • Apply a top-down partioning process Step2.2 Step2.1 step1
Method 2: spatial indexing • Multidimensional spatial techniques • Kd-tree (similar to Mondrain algorithm) • R-tree and its variations Upper layer Leaf layer R+-tree R-tree
Compacting bounds Information is better preserved Example: uncompacted: age[1-80], salary[10k-100k] compacted: age[20-40], salary[10k-50k] Original Mondrain does not consider compacting bounds For R+-Tree, it is automatically done.
Benefits of using R+-Tree • Scalable: originally designed for indexing disk-based large data • Multi-granularity k-anonymity: layers • Better performance • Better quality
Performance Mondrain
Utility • Metrics • Discenibility penalty • KL divergence: describe the difference between a pair of distributions • Certainty penalty Anonymized data distribution T: table, t: record, m: # of attributes, t.Ai generaled range, T.Ai total range
Other issues • Sparse high-dimensionality • Transactional data boolean matrix “On the anonymization of sparse high-dimensional data” ICDE08 • Relate to the clustering problem of transactional data! • The above one uses matrix-based clustering • item based clustering (?)
Other issues • Effect of numerizing categorical data • Ordering of categories may have certain impact on quality • General-purpose utility metrics vs. special task oriented utility metrics • Attacks on k-anonymity definition