190 likes | 302 Views
On the Anonymization of Sparse High-Dimensional Data. 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong Kong taoyf@cse.cuhk.edu.hk. Publishing Transaction Data. Publishing transaction data Retail chain-owned shopping cart data
E N D
On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong Kong taoyf@cse.cuhk.edu.hk
Publishing Transaction Data • Publishing transaction data • Retail chain-owned shopping cart data • Infer consumer spending patterns • Correlations among purchased items • e.g., 90% of cereals buyers also buy milk • What about privacy?
Privacy Threat Quasi-identifying Items Sensitive Items
Privacy Paradigm • ℓ-diversity • prevent association between quasi-identifier and sensitive attributes • Create groups of transactions • freq. of an SA value in a group < 1/p • Objective • Enforce privacy • Preserve correlations among items • Challenge: high data dimensionality
Data Re-organization PRESERVES CORELATIONS! Band Matrix Organization
Published Data Summary of Sensitive Items
Contributions • Novel data representation • Preserves correlation among items • Efficient heuristic for group formation • Linear time to data size • Supports multiple sensitive items
State-of-the-art: Mondrian[FWR06] • Generalization-based • data-space partitioning • similar to k-d-trees • split recursively until privacy condition does not hold • constrained global recoding k = 2 Age 20 40 60 GENERALIZATION + HIGH DIMENSIONALITY = UNACCEPTBLE INFORMATION LOSS 40 60 Weight 80 100 [FWR06] K. LeFevre et al. Mondrian Multidimensional k-anonymity, Proceedings of the 22nd International Conference on Data Engineering (ICDE), 2006
State-of-the-art: Anatomy[XT06] • Permutation-based method • discloses exact QID values “Anatomized” table RANDOM GROUP FORMATION DOES NOT PRESERVE CORRELATIONS |G|! permutations [XT06] X. Xiao and Y. Tao. Anatomy: simple and effective privacy preservation, Proceedings of the 32nd international conference on Very Large Data Bases (VLDB), 2006
Bandwidth = U+L+1 Minimizing bandwidth is NP-hard Band Matrix Representation
Reverse Cuthil-McKee (RCM) • Heuristic Bandwidth Minimization • Solves corresponding graph labeling problem • Permutes rows and columns • Complexity N* D * log D • N = matrix rows (# transactions) • D = maximum degree of any vertex
Group Formation • Correlation-aware Anonymization of High-Dimensional Data (CAHD) • Use the order given by RCM • Consecutive transactions highly correlated • O(pN) complexity
Experimental Setting • BMS dataset • Compare with hybrid PermMondrian(PM) • Combines Mondrian with Anatomy • Query Workload • Reconstruction Error
Conclusions • Anonymizing transaction data • High-dimensionality • Preserving correlation • Future work • Different encodings for data representation • Enhance correlation among consecutive rows