450 likes | 623 Views
Anonymizing Sequential Releases. Ke Wang Simon Fraser University wangk@cs.sfu.ca. Benjamin C. M. Fung Simon Fraser University bfung@cs.sfu.ca. ACM SIGKDD 2006. Motivation: Sequential Releases. Previous works address single release only. Data are released in multiple shots.
E N D
Anonymizing Sequential Releases Ke Wang Simon Fraser University wangk@cs.sfu.ca Benjamin C. M. Fung Simon Fraser University bfung@cs.sfu.ca ACM SIGKDD 2006
Motivation: Sequential Releases • Previous works address single release only. • Data are released in multiple shots. • An organization makes a new release: • New information become available. • A tailored view for each data sharing purpose. • Separate release for sensitive information and identifying information. • Related releases sharpens the identification of individuals by a global quasi-identifier.
Do not want Name to be linked to Disease in the join of the two releases.
join sharpens identification: {Bob, HIV} has groups size 1.
join weakens identification: {Alice, Cancer} has groups size 4. lossy join: combat join attack.
join enables inferences across tables: AliceCancer has 100% confidence.
Related Work • k-anonymity [SS98, FWY05, BA05, LDR05, WYC04, WLFW06] • Quasi-identifier (QID): a set of identifying attributes in the table. If some record is linked to an external source by a QID value, so are at least k-1 other records. • The database is made anonymous to itself. • In sequential releases, the database must be made anonymous to the combination of all releases thus far.
Related Work • l-diversity [MGK06] • Ensures that sensitive values are “well-represented” in each QID group, measured by entropy. • Confidence limiting [WFY05, WFY06]: qid s, confidence < h where qid is a value on QID, s is a sensitive value.
Related Work • View releases • e.g., T1 and T2 are two views, both can be modified before the release: more room for satisfying privacy and information requirements. • [MW04, DP05] measure information disclosure of a view set wrt a secret view. • [YWJ05, KG06] detect privacy violation by a view set over a base table. • They measure or detect violations, but do not remove them.
Sequential Release • Sequential release: • Current release T1. Previous release T2. • T1 was unknown when T2 was released. • T2, once released, cannot be modified when T1 is released. • Solution #1: k-anonymize all attributes in T1. • Excessive distortion. • Solution #2: generalize T1 based on T2. • Monotonically distort the later release. • Solution #3: release a “complete” cohort of all potential releases anonymized at one time. • Require predicting all future releases
Intuition of Our Approach • A lossy join hides the true join relationship to cripple a global QID. • Generalizing the current release T1 so that the join with the previous release T2 becomes lossy enough to disorient the attacker. • Two general notions of privacy: (X,Y)-anonymity and (X,Y)-linkability, where X and Y are sets of attributes.
(X,Y)-Privacy • k-anonymity: # of distinct records for each QID value ≥ k. • (X,Y)-anonymity: # of distinct Y values for each X value ≥ k. • (X,Y)-linkability:the maximum confidence that a record contains y given that it contains x≤ k, where (x,y) are values on X and Y. • Generalize k-anonymity [SS98] and confidence limiting [WFY05, WFY06].
Example: (X,Y)-Anonymity • QID = {Job, Zip, PoB} is not a key. • k-anonymity fails to ensure that each value on QID is linked to at least kdistinct patients.
Example: (X,Y)-Anonymity • With (X,Y)-anonymity, • specify the anonymity wrt patients by letting X = {Job, Zip, PoB} and Y = Pid • Each X group must be linked to at least k distinct values on Pid. • If X = {Job, Zip, PoB} and Y = Test, each X group is required to be linked to at least k distinct tests.
Example: (X,Y)-Linkability • {Banker,123,Canada} HIV (75% confidence). • With Y = Test, the (X,Y)-linkability states that no test can be inferred from a value on X with a confidence higher than a given threshold.
Problem Statement • The data holder has previously released T2 and wants to release T1, where T2 and T1 are projections of the same underlying table. • Want to ensure (X,Y)-privacy on the join of T1 and T2. • Sequential anonymization is to generalize T1 on X∩ att(T1) so that the join of T1 and T2 preserves the (X,Y)-privacy and T1 remains as useful as possible.
Job ANY Professional Admin Engineer Lawyer Banker Clerk Generalization / Specialization • Each generalization replaces all child values with the parent value. • A cut contains exactly one value on every root-to-leaf path. • Each specializationv {v1,…,vc}, replaces the value v in every record containing v with the child value vi that is consistent with the original domain value in the record.
Generalization / Specialization • An interval of a continuous attribute is split on-the-fly to maximize information utility. • e.g., age [30-40) [30-37), [37-40) • The split at 37 maximizes the information gain. • A taxonomy tree is dynamically grown for each continuous (non-join) attribute.
Match Function • Given T1 and T2, the attacker may apply prior knowledge to match the records in T1 and T2. • So, the data holder applies such prior knowledge for matching: • schema information of T1 and T2. • taxonomies for attributes. • following inclusion-exclusion principle.
Match Function • Let t1 T1 and t2 T2. • Consistency Predicate: t1.A matches t2.A if they are on the same generalization path for attribute A. • e.g., Male matches Single Male. • Inconsistency Predicate: t1.A matches t2.B only if t1.A and t2.B are not semantically inconsistent. • Excludes impossible matches. • e.g., Male and Pregnant are semantically inconsistent, so are Married Male and 6 Month Pregnant.
Algorithm Overview Top-Down Specialization for Sequential Anonymization Input: T1, T2, a (X,Y)-privacy requirement, a taxonomy tree for each attribute in X1 where X1=X ∩ att(T1). Output: a generalized T1 satisfying the privacy requirement. • generalize every value of Aj to ANYj where AjX1; • while there is a valid candidate in ỤCutjdo • find the winner w of highest Score(w) from ỤCutj; • specialize w on T1 and remove w from ỤCutj; • update Score(v) and the valid status for all v in ỤCutj; • end while • output the generalized T1 and ỤCutj;
Monotonic Privacy • Theorem 1: On a single table, the (X,Y)-privacy is anti-monotone wrt specialization on X. • If violated, remains violated after a specialization. • AY(X) is non-increasing wrt specialization on X. • X always reduces the set of records that contain a X value, therefore, reduces the set of Y values that co-occur with a X value. • LY(X) is non-decreasing wrt specialization on X. • A specialization v {v1,…,vc} transforms a value x on X to the specialized values x1,…,xc on X. • If ly(xi) < ly(x) for some xi, there must exist some xj such that ly(xj) > ly(x) (otherwise, ly(x) < ly(xi)).
Monotonic Privacy • On the join of T1 and T2, in general, (X,Y)-anonymity is not anti-monotone wrt a specialization on X∩ att(T1). • Specializing T1 may create dangling records. • Two tables are population-related if every record in each table has at least one matching record in the other table no dangling record. • Lemma 1: If T1 and T2 are population-related, AY(X) is non-increasing wrt specialization on X∩ att(T1).
Monotonic Privacy • Lemma 2: If Y contains attributes from T1 or T2, but not from both, LY(X) does not decrease after specialization of T1 on the attributes X∩ att(T1). • Theorem 2: Assume that T1 and T2 are projections of the same underlying tables, (X,Y)-anonymity and (X,Y)-linkability on the join of T1 and T2 are anti-monotone wrt specialization of T1 on X∩ att(T1).
Score Metric • Score(v) evaluates the “goodness” of a specialization v for preserving privacy and information. • Each specialization v gains some information and loses some privacy. We maximize • InfoGain(v) is measured on T1. • PrivLoss(v) is measured on the join of T1 and T2.
Information Gain • If T1 is released for classification on a specified class column, InfoGain(v) could be the reduction of the class entropy: • T1[v] denotes the set of generalized records in T1 that contain v before the specialization. • T1[vi] denotes the set of records in T1 that contain vi after the specialization. • InfoGain(v) could be the notion of distortion.
Privacy Loss • PrivLoss(v) is measured by the decrease of AY(X) or the increase of LY(X) due to the specialization of v: AY(X) - AY(Xv) for (X,Y)-anonymity LY(Xv) - LY(X) for (X,Y)-linkability where X and Xv represent the attributes before and after specializing v respectively.
Challenges • Each specialization on w affects the matching of join, thus, privacy checking. • too expensive to rejoin the two tables for each specialization. • Materializing the join is impractical. • A lossy join can be very large. Our solution: Incrementally maintains some count statistics to update Score(v) without executing the join.
Data Structure • Expensive operations on specializing w • accessing the records in T1 containing w • matching the records in T1 with the records in T2. • X1 = X ∩ att(T1) and X2 = X ∩ att(T2), • J1 and J2 denote the join attributes in T1 and T2.
Data Structure • Tree1: partition T1 records by the attributes X1 and J1-X1 in that order, one level per attribute. • Link[v] links up all nodes for v at the attribute level of v. • Tree2: partition T2 records by the attributes J2 and X2-J2 in that order. • Tree2 is static. • Probe the matching partitions in Tree2. • Match the last |J1| attributes in a partition in Tree1 with the first |J2| attributes in Tree2.
Analysis • On specializing w, Link[w] provides a direct access to the records involved in T1 • Tree2 provides a direct access to the matching partitions in T2. • Matching is performed at the partition level, not at the record level. • The cost of each iteration has two parts. • Specialize the affected partitions on Link[w]. • Update the score and status of candidates using count statistics. • Each record in T1 is accessed at most | X∩ att(T1) | h times where h is the maximum height of the taxonomies.
Empirical Study • The Adult data set. 45222 records. • Two versions of (T1,T2) • Set A (categorical attributes only) • T1 contains the Class attribute, the 3 categorical attributes and the 3 join attributes. • T2 contains the 2 categorical attributes and the 3 join attributes. • Set B (both categorical and continuous) • T1 contains the additional 6 continuous attributes from Taxation Department.
Schema for Set A • T1 contains the Class attribute
Empirical Study • Classification metric • Classification error on the generalized testing set of T1. • Distortion metric [SS98] • Categorical: 1 unit of distortion for each generalization. • Continuous: Suppose v is generalized to interval [a-b). Unit of distortion = (b-a)/(f2-f1), where [f1,f2) is the full range of the attribute. • Normalize total distortion by the number of records.
(X,Y)-Anonymity • TopN attributes: most important for classification. • Chosen by successively removing the top attribute in a decision tree. • Join attributes are the Top3 attributes. • If not important, simply remove them. • X contains • TopN attributes in T1 for a specified N (to ensure that the generalization is performed on important attributes), • all join attributes, • all attributes in T2 (to ensure X is global).
Distortion of (X,Y)-anonymity • Ki is a key in Ti. • XYD: produced by our method with Y = K1. • KAD: produced by k-anonymity on T1 with QID=att(T1). Set A Set B
Classification error of (X,Y)-anonymity • XYE: produced by our method with Y = K1. • XYE(row): produced by our method with Y={K1,K2}. • BLE: produced by the unmodified data. • KAE: produced by k-anonymity on T1 with QID=att(T1). • RJE: produced by removing all join attributes from T1. Set B Set A
(X,Y)-Linkability • Y contains the TopN attributes. • If not important, simply remove them. • X contains the rest of the attributes in T1 and T2, except T2.Ra and T2.Nc because otherwise no privacy requirement can be satisfied. • Focus on the classification error because the distortion due to (X,Y)-linkability is not comparable with the distortion due to k-anonymity.
Classification error of (X,Y)-linkability • XYE: produced by our method with Y = TopN. • BLE: produced by the unmodified data. • RJE: produced by removing all join attributes from T1. • RSE: produced by removing all attributes in Y from T1. Set B Set A
Scalability (X,Y)-anonymity (k=40) (X,Y)-linkability (k=90%)
Conclusion • Previous works on k-anonymization focused on a single release of data. • Studied the sequential anonymization problem. • Extended the privacy notion to this model. • Introduced lossy join as a way to hide the join relationship among releases. • Addressed computational challenges due to large size of lossy join. • Extendable to more than one previously released tables T2,…,Tp.
References [BA05] R. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In IEEE ICDE, pages 217.228, 2005. [DP05] A. Deutsch and Y. Papakonstantinou. Privacy in database publishing. In ICDT, 2005. [FWY05] B. C. M. Fung, K. Wang, and P. S. Yu. Top-down specialization for information and privacy preservation. In IEEE ICDE, pages 205.216, April 2005. [KG06] D. Kifer and J. Gehrke. Injecting utility into anonymized datasets. In ACM SIGMOD, Chicago, IL, June 2006.
References [LDR05] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Incognito: Efcient full-domain k-anonymity. In ACM SIGMOD, 2005. [MGK06] A. Machanavajjhala, J. Gehrke, and D. Kifer. l-diversity: Privacy beyond k-anonymity. In IEEE ICDE, 2006. [MW04] A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In PODS, 2004. [SS98] P. Samarati and L. Sweeney. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In IEEE Symposium on Research in Security and Privacy, May 1998.
References [WFY05] K. Wang, B. C. M. Fung, and P. S. Yu. Template-based privacy preservation in classification problems. In IEEE ICDM, pages 466.473, November 2005. [WFY06] K. Wang, B. C. M. Fung, and P. S. Yu. Handicapping attacker's condence: An alternative to k-anonymization. Knowledge and Information Systems: An International Journal, 2006. [WYC04] K. Wang, P. S. Yu, and S. Chakraborty. Bottom-up generalization: A data mining solution to privacy protection. In IEEE ICDM, November 2004.
References [WLFW06] R. C. W. Wong, J. Li., A. W. C. Fu, and K. Wang. (,k)-anonymity: An enhanced k-anonymity model for privacy preserving data publishing. In ACM SIGKDD, 2006. [YWJ05] C. Yao, X. S. Wang, and S. Jajodia. Checking for k-anonymity violation by views. In VLDB, 2005.