340 likes | 472 Views
The 1 1 th International Conference on Extending Database Technology ( EDBT 2008 ). Anonymity for Continuous Data Publishing. http://www.ciise.concordia.ca/~fung. Benjamin C. M. Fung Concordia University Montreal, QC, Canada. Ke Wang Simon Fraser University Burnaby, BC, Canada. Jian Pei
E N D
The 11thInternational Conference on Extending Database Technology (EDBT 2008) Anonymity for Continuous Data Publishing http://www.ciise.concordia.ca/~fung Benjamin C. M. Fung Concordia University Montreal, QC, Canada Ke Wang Simon Fraser University Burnaby, BC, Canada Jian Pei Simon Fraser University Burnaby, BC, Canada Ada Wai-Chee Fu The Chinese University of Hong Kong
Privacy-Preserving Data Publishingk-anonymity [SS98] (Hospital)
Privacy Requirement • k-anonymity [SS98] • Every QID group contains at least k records. • Confidence bounding [WFY05, WFY07] • Bound the confidence QIDsensitive value within h%. • l-diversity [MGKV06] • Every QID group contains l well-represented distinct sensitive values.
Continuous Data Publishing Model • At time T1, • Collected a set of raw data records D1 • Published a k-anonymous version of D1, denoted release R1. • At time T2, • Collect a new set of raw data records D2 • Want to publish all data collected so far. • Publish a k-anonymous version of D1UD2, denoted release R2.
Continuous Data Publishing Model R1 D1 R2 D1 D2
Correspondence Attacks • An attacker could “crack” the k-anonymity by comparing R1 and R2. • Background knowledge: • QID of a target victim (e.g., Alice is born in France and is a lawyer.) • Timestamp of a target victim. • Correspondence knowledge: • Every record in R1 has a corresponding record in R2. • Every record timestamped T2 has a record in R2, but not in R1.
Our Contributions • What exactly are the records that can be excluded (cracked) based on R1 and R2? • Systematically characterize the set of cracked records by correspondence attacks. • Propose the notion of BCF-anonymityto measure anonymity after excluding the cracked records. • Developed an efficient algorithm to identify a BCF-anonymized R2, and studied its data quality. • Extended the proposed approach to deal with more than two releases and other privacy notions.
Problem Statements • Detection problem: • Determine the number of cracked records in the worst case by applying the correspondence knowledge on the k-anonymized R1 and R2. • Anonymization problem: • Given R1, D1 and D2, we want to generalize R2 = D1UD2 so that R2 satisfies a given BCF-anonymity requirement and remains as useful as possible wrt a specified information metric.
Forward-Attack (F-Attack) Alice: {France, Lawyer} with timestamp T1. Attempt to identify her record in R1. a1, a2, a3 cannot all originate from [France, Lawyer]. Otherwise, R2 would have at least three [France, Professional, Flu].
F-Attack qid1 g1 g1' CG(qid1,qid2) = {(g1,g2),(g1',g2')} qid2 g2' g2
F-Attack Crack size of g1 wrt P: c = |g1| – min(|g1|,|g2|) c = 3 – min(3, 2) = 1. Crack size of g1' wrt P: c = |g1'| – min(|g1'|,|g2'|) c = 2 – min(2, 3) = 0. F(P, qid1, qid2) = c over all CG(qid1, qid2)
Definition: F-Anonymity • F(qid1, qid2) denotes the maximum F(P, qid1, qid2) for any target P that matches (qid1, qid2). • F(qid1) denotes the maximum F(qid1, qid2) for all qid2 in R2. • F-anonymity of (R1,R2), denoted by FA(R1,R2), is the minimum(|qid1| - F(qid1)) for all qid1 in R1.
Cross-Attack (C-Attack) Alice: {France, Lawyer} with timestamp T1. Attempt to identify her record in R2. At least one of b4,b5,b6 must have timestamp T2. Otherwise, R1 would have at least three records [Europe, Lawyer, Diabetes]
C-Attack Crack size of g2 wrt P: c = |g2| – min(|g1|,|g2|) c = 2 – min(3, 2) = 0 Crack size of g2' wrt P: c = |g2'| – min(|g1'|,|g2'|) c = 3 – min(2, 3) = 1 C(P, qid1, qid2) = c over all CG(qid1, qid2)
Definition: C-Anonymity • C(qid1, qid2) denotes the maximum C(P, qid1, qid2) for any target P that matches (qid1, qid2). • C(qid2) denotes the maximum C(qid1, qid2) for all qid1 in R1. • C-anonymity of (R1,R2), denoted by CA(R1,R2), is the minimum(|qid2| - C(qid2)) for all qid2 in R2.
Backward-Attack (B-Attack) Alice: {UK, Lawyer} with timestamp T2. Attempt to identify her record in R2. At least one of b1,b2,b3 must have timestamp T1. Otherwise, one of a1,a2,a3 would have no corresponding record in R2.
B-Attack Target person P {UK, Lawyer} with timestamp T2. Crack size of g2 wrt P: c = max(0,|G1|-(|G2|-|g2|)) g2 = {b1, b2, b3} G1 = {a1, a2, a3} G2 = {b1, b2, b3, b7, b8} c = max(0,3-(|5|-|3|)) = 1
B-Attack Crack size of g2' wrt P: c = max(0,|G1'|-(|G2'|-|g2'|)) g2' = {b9, b10} G1' = {a4, a5} G2' = {b4, b5, b6, b9, b10} c = max(0,2-(|5|-|2|)) = 0 B(P, qid2) = c over all g2 in qid2.
Definition: B-Anonymity • B(qid2) denotes the maximum B(P, qid2) for any target P that matches qid2. • B-anonymity of (R1,R2), denoted by BA(R1,R2), is the minimum(|qid2| - B(qid2)) for all qid2 in R2.
In brief… • cracked records: either do not originate from Alice's QIDor do not have Alice's timestamp. • Such cracked records are not related to Alice, thus, excluding them allows the attacker to focus on a smaller set of candidate records.
Definition: BCF-Anonymity • A BCF-anonymity requirement states that all of BA(R1,R2)k, CA(R1,R2)k, and FA(R1,R2)k, where k is a user-specified threshold. • We now present an algorithm for anonymizing R2=D1UD2.
BCF-Anonymizer • generalize every value for Aj QID in R2 to ANYj; • let candidate list contain all ANYj; • sort candidate list by Score in descending order; • while the candidate list is not empty do • if the first candidate w in candidate list is valid then • specialize w into {w1,…,wz} in R2; • compute Score for all wi; and add them to candidate list; • sort the candidate list by Score in descending order; • else • remove w from the candidate list; • end if • end while • output R2 ANY Europe America …… France UK ……
Anti-Monotonicity of BCF-Anonymity • Theorem: Each of FA, CA and BA is non-increasing with respect to a specialization on R2. • Guarantee that the produced BCF-anonymized R2 is maximally specialized (suboptimal) which any further specialization leads to a violation.
Empirical Study • Study the threat of correspondence attacks. • Evaluate the information usefulness of a BCF-anonymized R2. • Adult dataset (US Census data) • 8 categorical attributes • 30,162 records in training set • 15,060 records in testing set
Experiment Settings • D1 contains all records in testing set. • Three cases of D2 at timestamp T2: • 200D2: D2 contains the first 200 records in the training set, modelling a small set of new records at T2. • 2000D2: D2 contains the 2000 records in the training set, modelling a medium set of new records at T2. • allD2: D2 contains all 30,162 records in the training set, modelling a large set of new records at T2.
Anonymization • BCF-Anonymized R2: Our method. • k-Anonymized R2: Not safe from correspondence attacks. • k-Anonymized D2: Anonymize D2 separately from D1.
Related Work • Byun et al. (VLDB-SDM06) is an early study on continuous data publishing scenario. • Anonymization relies on delaying records release and the delay can be unbounded. • In our method, records collected at timestamp Ti are always published in the corresponding release Ri without delay. • Xiao and Tao (SIGMOD07) presents the first study to address both record insertions and deletions in data re-publication. • Anonymization relies on generalization and adding counterfeit records.
Related Work • Wang and Fung (SIGKDD06) study the problem of anonymizing sequential releases where each subsequent release publishes a different subset of attributes for the same set of records. R1 A B C D R2
Conclusion & Contributions Systematically characterize different types of correspondence attacks and concisely compute their crack size. Define BCF-anonymity requirement. Present an anonymization algorithm to achieve BCF-anonymity while preserving information usefulness. Extendable to multiple releases. 31
For more information: http://www.ciise.concordia.ca/~fung Acknowledgement: • Reviewers of EDBT • Concordia University • Faculty Start-up Grants • Natural Sciences and Engineering Research Council of Canada (NSERC) • Discovery Grants • PGS Doctoral Award
References [BSBL06] J.-W. Byun, Y. Sohn, E. Bertino, and N. Li. Secure anonymization for incremental datasets. In VLDB Workshop on Secure Data Management (SDM), 2006. [MGKV06] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. In ICDE, Atlanta, GA, April 2006. [PXW07] J. Pei, J. Xu, Z. Wang, W. Wang, and K. Wang. Maintaining k-anonymity against incremental updates. In SSDBM, Banff, Canada, 2007
References [SS98] P. Samarati and L. Sweeney. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical report, SRI International, March 1998. [WF06] K. Wang and B. C. M. Fung. Anonymizing sequential releases. In ACM SIGKDD, Philadelphia, PA, August 2006, pp. 414-423. [WFY05] K. Wang, B. C. M. Fung, and P. S. Yu. Template-based privacy preservation in classification problems. In IEEE ICDM, pages 466-473, November 2005.
References [WFY07] K. Wang, B. C. M. Fung, and P. S. Yu. Handicapping attacker's confidence: an alternative to k-anonymization. Knowledge and Information Systems: An International Journal (KAIS), 11(3):345-368, April 2007. [XY07] X. Xiao and Y. Tao. m-invariance: Towards privacy preserving re-publication of dynamic datasets. In ACM SIGMOD, June 2007.