Anonymity for Continuous Data Publishing

The 11thInternational Conference on Extending Database Technology (EDBT 2008) Anonymity for Continuous Data Publishing http://www.ciise.concordia.ca/~fung Benjamin C. M. Fung Concordia University Montreal, QC, Canada Ke Wang Simon Fraser University Burnaby, BC, Canada Jian Pei Simon Fraser University Burnaby, BC, Canada Ada Wai-Chee Fu The Chinese University of Hong Kong

Privacy-Preserving Data Publishingk-anonymity [SS98] (Hospital)

Privacy Requirement • k-anonymity [SS98] • Every QID group contains at least k records. • Confidence bounding [WFY05, WFY07] • Bound the confidence QIDsensitive value within h%. • l-diversity [MGKV06] • Every QID group contains l well-represented distinct sensitive values.

Continuous Data Publishing Model • At time T1, • Collected a set of raw data records D1 • Published a k-anonymous version of D1, denoted release R1. • At time T2, • Collect a new set of raw data records D2 • Want to publish all data collected so far. • Publish a k-anonymous version of D1UD2, denoted release R2.

Continuous Data Publishing Model R1 D1 R2 D1 D2

Correspondence Attacks • An attacker could “crack” the k-anonymity by comparing R1 and R2. • Background knowledge: • QID of a target victim (e.g., Alice is born in France and is a lawyer.) • Timestamp of a target victim. • Correspondence knowledge: • Every record in R1 has a corresponding record in R2. • Every record timestamped T2 has a record in R2, but not in R1.

Our Contributions • What exactly are the records that can be excluded (cracked) based on R1 and R2? • Systematically characterize the set of cracked records by correspondence attacks. • Propose the notion of BCF-anonymityto measure anonymity after excluding the cracked records. • Developed an efficient algorithm to identify a BCF-anonymized R2, and studied its data quality. • Extended the proposed approach to deal with more than two releases and other privacy notions.

Problem Statements • Detection problem: • Determine the number of cracked records in the worst case by applying the correspondence knowledge on the k-anonymized R1 and R2. • Anonymization problem: • Given R1, D1 and D2, we want to generalize R2 = D1UD2 so that R2 satisfies a given BCF-anonymity requirement and remains as useful as possible wrt a specified information metric.

Forward-Attack (F-Attack) Alice: {France, Lawyer} with timestamp T1. Attempt to identify her record in R1. a1, a2, a3 cannot all originate from [France, Lawyer]. Otherwise, R2 would have at least three [France, Professional, Flu].

F-Attack qid1 g1 g1' CG(qid1,qid2) = {(g1,g2),(g1',g2')} qid2 g2' g2

F-Attack Crack size of g1 wrt P: c = |g1| – min(|g1|,|g2|) c = 3 – min(3, 2) = 1. Crack size of g1' wrt P: c = |g1'| – min(|g1'|,|g2'|) c = 2 – min(2, 3) = 0. F(P, qid1, qid2) = c over all CG(qid1, qid2)

Definition: F-Anonymity • F(qid1, qid2) denotes the maximum F(P, qid1, qid2) for any target P that matches (qid1, qid2). • F(qid1) denotes the maximum F(qid1, qid2) for all qid2 in R2. • F-anonymity of (R1,R2), denoted by FA(R1,R2), is the minimum(|qid1| - F(qid1)) for all qid1 in R1.

Cross-Attack (C-Attack) Alice: {France, Lawyer} with timestamp T1. Attempt to identify her record in R2. At least one of b4,b5,b6 must have timestamp T2. Otherwise, R1 would have at least three records [Europe, Lawyer, Diabetes]

C-Attack Crack size of g2 wrt P: c = |g2| – min(|g1|,|g2|) c = 2 – min(3, 2) = 0 Crack size of g2' wrt P: c = |g2'| – min(|g1'|,|g2'|) c = 3 – min(2, 3) = 1 C(P, qid1, qid2) = c over all CG(qid1, qid2)

Definition: C-Anonymity • C(qid1, qid2) denotes the maximum C(P, qid1, qid2) for any target P that matches (qid1, qid2). • C(qid2) denotes the maximum C(qid1, qid2) for all qid1 in R1. • C-anonymity of (R1,R2), denoted by CA(R1,R2), is the minimum(|qid2| - C(qid2)) for all qid2 in R2.

Backward-Attack (B-Attack) Alice: {UK, Lawyer} with timestamp T2. Attempt to identify her record in R2. At least one of b1,b2,b3 must have timestamp T1. Otherwise, one of a1,a2,a3 would have no corresponding record in R2.

B-Attack Target person P {UK, Lawyer} with timestamp T2. Crack size of g2 wrt P: c = max(0,|G1|-(|G2|-|g2|)) g2 = {b1, b2, b3} G1 = {a1, a2, a3} G2 = {b1, b2, b3, b7, b8} c = max(0,3-(|5|-|3|)) = 1

B-Attack Crack size of g2' wrt P: c = max(0,|G1'|-(|G2'|-|g2'|)) g2' = {b9, b10} G1' = {a4, a5} G2' = {b4, b5, b6, b9, b10} c = max(0,2-(|5|-|2|)) = 0 B(P, qid2) = c over all g2 in qid2.

Definition: B-Anonymity • B(qid2) denotes the maximum B(P, qid2) for any target P that matches qid2. • B-anonymity of (R1,R2), denoted by BA(R1,R2), is the minimum(|qid2| - B(qid2)) for all qid2 in R2.

In brief… • cracked records: either do not originate from Alice's QIDor do not have Alice's timestamp. • Such cracked records are not related to Alice, thus, excluding them allows the attacker to focus on a smaller set of candidate records.

Definition: BCF-Anonymity • A BCF-anonymity requirement states that all of BA(R1,R2)k, CA(R1,R2)k, and FA(R1,R2)k, where k is a user-specified threshold. • We now present an algorithm for anonymizing R2=D1UD2.

BCF-Anonymizer • generalize every value for Aj QID in R2 to ANYj; • let candidate list contain all ANYj; • sort candidate list by Score in descending order; • while the candidate list is not empty do • if the first candidate w in candidate list is valid then • specialize w into {w1,…,wz} in R2; • compute Score for all wi; and add them to candidate list; • sort the candidate list by Score in descending order; • else • remove w from the candidate list; • end if • end while • output R2 ANY Europe America …… France UK ……

Anti-Monotonicity of BCF-Anonymity • Theorem: Each of FA, CA and BA is non-increasing with respect to a specialization on R2. • Guarantee that the produced BCF-anonymized R2 is maximally specialized (suboptimal) which any further specialization leads to a violation.

Empirical Study • Study the threat of correspondence attacks. • Evaluate the information usefulness of a BCF-anonymized R2. • Adult dataset (US Census data) • 8 categorical attributes • 30,162 records in training set • 15,060 records in testing set

Experiment Settings • D1 contains all records in testing set. • Three cases of D2 at timestamp T2: • 200D2: D2 contains the first 200 records in the training set, modelling a small set of new records at T2. • 2000D2: D2 contains the 2000 records in the training set, modelling a medium set of new records at T2. • allD2: D2 contains all 30,162 records in the training set, modelling a large set of new records at T2.

Violations of BCF-Anonymity

Anonymization • BCF-Anonymized R2: Our method. • k-Anonymized R2: Not safe from correspondence attacks. • k-Anonymized D2: Anonymize D2 separately from D1.

Related Work • Byun et al. (VLDB-SDM06) is an early study on continuous data publishing scenario. • Anonymization relies on delaying records release and the delay can be unbounded. • In our method, records collected at timestamp Ti are always published in the corresponding release Ri without delay. • Xiao and Tao (SIGMOD07) presents the first study to address both record insertions and deletions in data re-publication. • Anonymization relies on generalization and adding counterfeit records.

Related Work • Wang and Fung (SIGKDD06) study the problem of anonymizing sequential releases where each subsequent release publishes a different subset of attributes for the same set of records. R1 A B C D R2

Conclusion & Contributions Systematically characterize different types of correspondence attacks and concisely compute their crack size. Define BCF-anonymity requirement. Present an anonymization algorithm to achieve BCF-anonymity while preserving information usefulness. Extendable to multiple releases. 31

For more information: http://www.ciise.concordia.ca/~fung Acknowledgement: • Reviewers of EDBT • Concordia University • Faculty Start-up Grants • Natural Sciences and Engineering Research Council of Canada (NSERC) • Discovery Grants • PGS Doctoral Award

References [BSBL06] J.-W. Byun, Y. Sohn, E. Bertino, and N. Li. Secure anonymization for incremental datasets. In VLDB Workshop on Secure Data Management (SDM), 2006. [MGKV06] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. In ICDE, Atlanta, GA, April 2006. [PXW07] J. Pei, J. Xu, Z. Wang, W. Wang, and K. Wang. Maintaining k-anonymity against incremental updates. In SSDBM, Banff, Canada, 2007

References [SS98] P. Samarati and L. Sweeney. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical report, SRI International, March 1998. [WF06] K. Wang and B. C. M. Fung. Anonymizing sequential releases. In ACM SIGKDD, Philadelphia, PA, August 2006, pp. 414-423. [WFY05] K. Wang, B. C. M. Fung, and P. S. Yu. Template-based privacy preservation in classification problems. In IEEE ICDM, pages 466-473, November 2005.

References [WFY07] K. Wang, B. C. M. Fung, and P. S. Yu. Handicapping attacker's confidence: an alternative to k-anonymization. Knowledge and Information Systems: An International Journal (KAIS), 11(3):345-368, April 2007. [XY07] X. Xiao and Y. Tao. m-invariance: Towards privacy preserving re-publication of dynamic datasets. In ACM SIGMOD, June 2007.

Anonymity for Continuous Data Publishing

Anonymity for Continuous Data Publishing

Presentation Transcript

Publishing Seasonally Adjusted Data

Privacy-Preserving Data Publishing

Anonymity

Anonymity through Data cubes

Continuous Data

Using Data for Continuous Improvement

Data Publishing with Dataverse

Data publishing

Protocols for Anonymity

Anonymity

Anonymity

Publishing Linked Sensor Data

Anonymity of Clickstream data

Continuous Data

Publishing Data: Scientific Data as Integral Part of Scholarly Publishing

Continuous Data Protector

Anonymity

Data Anonymization – Introduction and k-anonymity

Publishing Data

On Publishing Data - “Earth System Science Data” a Data Publishing Journal