190 likes | 217 Views
Explore the challenges of anonymizing high-dimensional data while preserving correlations among items to prevent privacy threats. Utilize novel data representations and efficient group formation heuristics for improved privacy enforcement. Compare methods and evaluate results with visualization in experimental settings. Consider enhancing data representation and correlation among rows for future work.
E N D
On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong Kong taoyf@cse.cuhk.edu.hk
Publishing Transaction Data • Publishing transaction data • Retail chain-owned shopping cart data • Infer consumer spending patterns • Correlations among purchased items • e.g., 90% of cereals buyers also buy milk • What about privacy?
Privacy Threat Quasi-identifying Items Sensitive Items
Privacy Paradigm • ℓ-diversity • prevent association between quasi-identifier and sensitive attributes • Create groups of transactions • freq. of an SA value in a group < 1/p • Objective • Enforce privacy • Preserve correlations among items • Challenge: high data dimensionality
Data Re-organization PRESERVES CORELATIONS! Band Matrix Organization
Published Data Summary of Sensitive Items
Contributions • Novel data representation • Preserves correlation among items • Efficient heuristic for group formation • Linear time to data size • Supports multiple sensitive items
State-of-the-art: Mondrian[FWR06] • Generalization-based • data-space partitioning • similar to k-d-trees • split recursively until privacy condition does not hold • constrained global recoding k = 2 Age 20 40 60 GENERALIZATION + HIGH DIMENSIONALITY = UNACCEPTBLE INFORMATION LOSS 40 60 Weight 80 100 [FWR06] K. LeFevre et al. Mondrian Multidimensional k-anonymity, Proceedings of the 22nd International Conference on Data Engineering (ICDE), 2006
State-of-the-art: Anatomy[XT06] • Permutation-based method • discloses exact QID values “Anatomized” table RANDOM GROUP FORMATION DOES NOT PRESERVE CORRELATIONS |G|! permutations [XT06] X. Xiao and Y. Tao. Anatomy: simple and effective privacy preservation, Proceedings of the 32nd international conference on Very Large Data Bases (VLDB), 2006
Bandwidth = U+L+1 Minimizing bandwidth is NP-hard Band Matrix Representation
Reverse Cuthil-McKee (RCM) • Heuristic Bandwidth Minimization • Solves corresponding graph labeling problem • Permutes rows and columns • Complexity N* D * log D • N = matrix rows (# transactions) • D = maximum degree of any vertex
Group Formation • Correlation-aware Anonymization of High-Dimensional Data (CAHD) • Use the order given by RCM • Consecutive transactions highly correlated • O(pN) complexity
Experimental Setting • BMS dataset • Compare with hybrid PermMondrian(PM) • Combines Mondrian with Anatomy • Query Workload • Reconstruction Error
Conclusions • Anonymizing transaction data • High-dimensionality • Preserving correlation • Future work • Different encodings for data representation • Enhance correlation among consecutive rows