Anonymizing Sequential Releases

Anonymizing Sequential Releases Ke Wang Simon Fraser University wangk@cs.sfu.ca Benjamin C. M. Fung Simon Fraser University bfung@cs.sfu.ca ACM SIGKDD 2006

Motivation: Sequential Releases • Previous works address single release only. • Data are typically released sequentially in multiple versions. • New information become available. • A tailored view for each data sharing purpose. • Separate releases for sensitive and identifying information.

Do not want Name to be linked to Disease in the join of the two releases.

join sharpens identification: {Bob, HIV} has groups size 1.

join weakens identification: {Alice, Cancer} has groups size 4. lossy join: combat join attack.

join enables inferences across tables: AliceCancer has 100% confidence.

Related Work • k-anonymity [SS98, FWY05, BA05, LDR05, WYC04, WLFW06] • Quasi-identifier (QID): e.g., {Job, birth date, Zip}. • The database is made anonymous to its local QID. • In sequential releases, the database must be made anonymous to a global QID spanning the join of all releases thus far.

Related Work • l-diversity [MGK06] • Sensitive values are “well-represented” in each QID group (measured by entropy). • Confidence limiting [WFY05, WFY06]: qid s, confidence < h where qid is a QID group, s is a sensitive value.

Related Work • View releases • T1 and T2 are two views in one release, both can be modified before the release. • [MW04, DP05] measures information disclosure of a view set wrt a secret view. • [YWJ05, KG06] detects privacy violation by a view set over a base table. • Detect, not eliminate, violations.

Sequential Release • Sequential release: • Current release T1. Previous release T2. • T1 was unknown when T2 was released. • T2 cannot be modified when T1 is released. • Solution #1: k-anonymize all attributes in T1 - excessive distortion. • Solution #2: generalize T1 based on T2 - monotonically distort the later release. • Solution #3: anonymize a complete cohort of all potential releases at one time – must predict all future releases

Intuition of Our Approach • A lossy join hides the true join relationship to cripple a global QID. • Generalize T1 so that the join with T2 becomes lossy enough to disorient the attacker. • Two general privacy notions: (X,Y)-anonymity and (X,Y)-linkability, where X and Y are sets of attributes.

(X,Y)-Privacy • k-anonymity: # of distinct records for each QID group ≥ k. • (X,Y)-anonymity: # of distinct Y values for each X group ≥ k. • (X,Y)-linkability:the maximum confidence of having a Y value given having a X value is ≤ k. • Generalize k-anonymity [SS98] and confidence limiting [WFY05, WFY06].

Example: (X,Y)-Anonymity • k-anonymity uses # of records as anonymity, fails to ensure kdistinctpatients.

Example: (X,Y)-Anonymity • Anonymity wrt patients (instead of records): • X = {Job, Zip, PoB} and Y = Pid • Each X group is linked to at least k distinct values on Pid. • Anonymity wrt tests: • X = {Job, Zip, PoB} and Y = Test • Each X group is linked to at least k distinct tests.

Example: (X,Y)-Linkability • {Banker,123,Canada}  HIV (75% confidence). • With Y = Test, (X,Y)-linkability states that no test can be inferred from a X group with confidence >a given threshold.

Problem Statement • The data holder made previous release T2 and now makes current release T1, where T2 and T1 are projections of the same underlying table. • Want to ensure (X,Y)-privacy on the join of T1 and T2, where X and Y are attribute sets on the join. • Sequential anonymization: generalize T1 on X∩ att(T1) so that the join satisfies (X,Y)-privacy and T1 remains as useful as possible.

Job ANY Professional Admin Engineer Lawyer Banker Clerk Generalization / Specialization • Each generalization replaces all child values with the parent value. • A cut contains exactly one value on every root-to-leaf path. • Alternatively, each specialization replaces the parent value with a consistent child value in the record.

Match Function • The attacker applies prior knowledge to match the records in T1 and T2. • So, the data holder applies such prior knowledge in sequential anonymization • We consider prior knowledge: • schema information of T1 and T2. • taxonomies for attributes. • the inclusion-exclusion principle.

Match Function • Let t1 T1 and t2 T2. • Inclusion Predicate: t1.Amatchest2.A if they are on the same generalization path for attribute A. • e.g., Male matches Single Male. • Exclusion Predicate: t1.Amatchest2.B only if they are not semantically inconsistent (based on common sense). • To exclude impossible matches. • e.g., Male and Pregnant are semantically inconsistent, so are Married Male and 6 Month Pregnant.

Algorithm Overview Top-Down Specialization Input: T1, T2, (X,Y)-privacy, a taxonomy tree for each attribute in X1=X ∩ att(T1). Output: a generalized T1 satisfying the privacy requirement. • generalize every value of Aj to ANYj where AjX1; • while there is a valid candidate in ỤCutjdo • find the winner w of highest Score(w) from ỤCutj; • specialize w on T1 and remove w from ỤCutj; • update Score(v) and the valid status for all v in ỤCutj; • end while • output the generalized T1 and ỤCutj;

Anti-Monotone Privacy • Theorem 1: On a single table, (X,Y)-privacy is anti-monotone wrt specialization on X: if violated, remains violated after a specialization. • On the join of T1 and T2, (X,Y)-privacy is not anti-monotone wrt specialization of T1. • Specializing T1 may create dangling records, e.g., by specializing “CA” into “LA” and “San Francisco”, “LA” records in T1 no longer match “San Francisco” records in T2.

Anti-Monotone Privacy • Theorem 2: Assume that T1 and T2 are projections of the same underlying table, (X,Y)-privacy on the join of T1 and T2 is anti-monotone wrt specialization of T1 on X∩ att(T1).

Score Metric • Each specialization gains some information and loses some privacy. We maximize gain per loss • InfoGain(v) is measured on T1. • PrivLoss(v) is measured on the join of T1 and T2.

Challenges Each specialization affects the matching of join, Score(v), and privacy checking. • rejoining T1 and T2 for each specialization is too expensive. • Materializing the join is impractical because a lossy join can be very large. Our solution: Incrementally maintains some count statistics without executing the join • extension of Top-Down Specialization [FWY05][WFY05]

Empirical Study • The Adult data set. 45222 records. Categorical attributes only.

Schema for T1 and T2 • T1 contains the Class Income level

Empirical Study • Classification metric • Classification error on the generalized testing set of T1. • Distortion metric [SS98] • 1 unit of distortion for generalization of each value in each record. • Normalized by the number of records.

(X,Y)-Anonymity • TopN attributes: most important for classification. • Join attributes are Top3 attributes. • X contains • TopN attributes in T1 (to ensure that the generalization is performed on important attributes), • all join attributes, • all attributes in T2 (to ensure X is global).

Distortion of (X,Y)-anonymity • Ki denotes the key in Ti. • XYD: our method with Y = K1. • KAD: k-anonymization on QID=att(T1).

Classification error of (X,Y)-anonymity • XYE: our method with Y = K1. • XYE(row): our method with Y={K1,K2}. • BLE: the unmodified data. • KAE: k-anonymization on QID=att(T1). • RJE: removing all join attributes from T1.

(X,Y)-Linkability • Y contains TopN attributes. • If not important, simply remove them. • X contains the rest of the attributes in T1 and T2. • Focus on classification error because no previous work studies distortion for (X,Y)-linkability.

Classification error of (X,Y)-linkability • XYE: our method with Y = TopN. • BLE: the unmodified data. • RJE: removing all join attributes from T1. • RSE: removing all attributes in Y from T1.

Scalability (X,Y)-anonymity (k=40) (X,Y)-linkability (k=90%)

Conclusion • Previous k-anonymization focused on a single release of data. • Studied the sequential anonymization problem when data are released sequentially and a global QID may span several releases. • Introduced lossy join to hide the join relationship and weaken the global QID. • Addressed challenges due to large size of lossy join. • Extendable to more than two releases T2,…,Tp.

References [BA05] R. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In IEEE ICDE, pages 217.228, 2005. [DP05] A. Deutsch and Y. Papakonstantinou. Privacy in database publishing. In ICDT, 2005. [FWY05] B. C. M. Fung, K. Wang, and P. S. Yu. Top-down specialization for information and privacy preservation. In IEEE ICDE, pages 205.216, April 2005. [KG06] D. Kifer and J. Gehrke. Injecting utility into anonymized datasets. In ACM SIGMOD, Chicago, IL, June 2006.

References [LDR05] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Incognito: Efcient full-domain k-anonymity. In ACM SIGMOD, 2005. [MGK06] A. Machanavajjhala, J. Gehrke, and D. Kifer. l-diversity: Privacy beyond k-anonymity. In IEEE ICDE, 2006. [MW04] A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In PODS, 2004. [SS98] P. Samarati and L. Sweeney. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In IEEE Symposium on Research in Security and Privacy, May 1998.

References [WFY05] K. Wang, B. C. M. Fung, and P. S. Yu. Template-based privacy preservation in classification problems. In IEEE ICDM, pages 466.473, November 2005. [WFY06] K. Wang, B. C. M. Fung, and P. S. Yu. Handicapping attacker's condence: An alternative to k-anonymization. Knowledge and Information Systems: An International Journal, 2006. [WYC04] K. Wang, P. S. Yu, and S. Chakraborty. Bottom-up generalization: A data mining solution to privacy protection. In IEEE ICDM, November 2004.

References [WLFW06] R. C. W. Wong, J. Li., A. W. C. Fu, and K. Wang. (,k)-anonymity: An enhanced k-anonymity model for privacy preserving data publishing. In ACM SIGKDD, 2006. [YWJ05] C. Yao, X. S. Wang, and S. Jajodia. Checking for k-anonymity violation by views. In VLDB, 2005.

Anonymizing Sequential Releases