290 likes | 418 Views
Algorithm Safe Privacy-Preserving Data Publishing. Xin Jin George Washington University Nan Zhang George Washington University Gautam Das University of Texas at Arlington. Outline. Introduction Algorithm-safe Data Publishing Model
E N D
Algorithm Safe Privacy-Preserving Data Publishing Xin Jin George Washington University Nan Zhang George Washington University Gautam Das University of Texas at Arlington
Outline • Introduction • Algorithm-safe Data Publishing Model • AmendmentToolset: Look-ahead Partitioning and Stratified Pick-up • Experimental Results • Conclusion
Privacy-Preserving Data Publishing • Share individual records to enable analytical tasks (e.g. aggregate query answering, data mining) while protecting individual privacy information.
What is Algorithm-based Disclosure? • Algorithm-based disclosure in existing methods (e.g., [WFW+07] [LLV07] [MGK+07]). • An example by using ℓ–diversity. 2 – diversity Table
What If an Adversary Knows the Algorithm? Published Table 1st Conjectured Original Data Better Output Table
What If an Adversary Knows the Algorithm? Published Table Better Output Table 2nd Conjectured Original Data
What If an Adversary Knows the Algorithm? Published Table 3rd Conjectured Original Data
Algorithm-safe Data Publishing (ASP) Q: How likely does Eve have HIV? Smart User Naïve User My answer Algorithm = My answer Background Knowledge Background Knowledge Published Table Published Table
Algorithm-safe Data Publishing (ASP) • Problem Definition: For each tuple (i.e., row) ti= <q, s> in the original data T, there is: Pr{ti [SA] = s’ | ti [QI] = q, K} = Pr{ti [SA] = s’ | ti [QI] = q, K, A} for each s’ in the domain of SA, where K is background knowledge and A is the data publishing algorithm.
Necessary Condition #1QI*-Independence Query SA by QI Query Data Publisher Safe QI-SA correlation QI-SA correlation Safe QI-SA correlation Oracle Original Data QI*-Independence : Generated QI* is conditional independent of the original SA, given a combination of QI and the published SA*. ASP Published Table
Necessary Condition #2SA*-Independence Impossible QI-SA correlation Query SA by QI Query Data Publisher Safe QI-SA correlation Perturbed Safe QI-SA correlation QI-SA correlation Oracle Original Data SA*-Independence : Generated SA* is conditional independent of the original SA, given a combination of QI, QI* and the impossible QI-SA correlation. ASP Published Table
How to Achieve ASP Model? • Play the Role of Oracle • Satisfy QI*-Independence • Never perturb SA • Worst-case Eligibility Test • Look-ahead partitioning
A Mondrian Method [LDR06] to Achieve ℓ–diversity (ℓ = 2) t5 t1 t6 t2 t7 t3 t4 t8
A Mondrian Method to Achieve ℓ–diversity t5 S1 t1 S2 t6 S3 t2 S4 t7 S5 t3 t4 t8 x = 5
A Mondrian Method to Achieve ℓ–diversity t5 t1 S1 S2 t6 t2 S3 y = 5 S4 t7 S5 t3 t4 t8 x = 5
A Mondrian Method to Achieve ℓ–diversity t5 S1 t1 S2 t6 S3 t2 S4 S5 t7 t3 t4 t8 x = 5
Look-Ahead Partitioning t5 t1 t6 t2 t7 t3 t4 t8
Look-Ahead Partitioning t5 S1 t1 S2 t6 S3 t2 S4 t7 S5 t3 t4 t8
Look-Ahead Partitioning t5 S1 t1 S2 t6 S3 t2 S4 S5 t7 t3 t4 t8 x = 5
Amendment Toolset • Look-Ahead Partitioning : Execute the partitioning if a worst (i.e., most skewed) scenario of QI-SA correlation is eligible to achieves the given privacy guarantee (e.g., ℓ–diversity). • Can be extended to other algorithms such as Hilb [GKKM07], Incognito [LDR05], MASK [WFW+07], etc. • Limitation: May harm the utility due to large-sized groups. • Stratified Pick-up: Take as input the anonymous groups and attempt to further partition each of these groups iteratively based solely on the distinctness of SA values.
Stratified Pick-Up t5 S1 t1 S2 t6 S3 t2 S4 S5 t7 t3 t4 t8
Experiment Setup • Adult Dataset (http://archive.ics.uci.edu/ml/) • 45,222 tuples • SA: Education. • Census Dataset (http://ipums.org) • 300K tuples • SA: Occupation
Conclusion • We unveil algorithm-based disclosure is much more significant than ever studied. • We rigidly define Algorithm-Safe data Publishing (ASP) model. • We propose a screening toolfor algorithm-based disclosure by two necessary conditions. • We explore amendments on problematic methods (if “diagnosed” of algorithm-based disclosure).
References [WFW+07] Wong, R. C. and Fu, A. W. and Wang, K. and Pei, J. Minimality Attack in Privacy-Preserving Data Publishing. [LLV07] Li, N. and Li, T. and Venkatasubramanian, S. t-Closeness: Privacy Beyond k-anonymity and ℓ-diversity [MGK+07] Machanavajjhala, A. and Gehrke, J. and Kifer, D. and Venkitasubramaniam, M. ℓ-diversity: Privacy Beyond k-anonymity. [ZJB07] Zhang, L. and Jajodia, S. and Brodsky, A. Information Disclosure under Realistic Assumptions: Privacy versus Optimality. [GKKM07] Ghinita, G. and Karras, P. and Kalnis, P. and Mamoulis, N. Fast Data Anonymization with Low Information Loss. [LDR06] LeFevre, K. and DeWitt, D. J. and Ramakrishnan, R. Mondrian Multidimensional k-anonymity [LDR05] LeFevre, K. and DeWitt, D. J. and Ramakrishnan, R. Incognito: efficient full-domain k-anonymity [XT06] Xiao, X. and Tao, Y. Anatomy: Simple and Effective Privacy Preservation.