10 likes | 235 Views
Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu + *Concordia University, Montreal, Canada + University of Illinois at Chicago, IL, USA. Introduction. Anonymization Algorithm.
E N D
Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu+ *Concordia University, Montreal, Canada +University of Illinois at Chicago, IL, USA Introduction Anonymization Algorithm Privacy-preserving data publishing (PPDP) studies how to transform raw data into a version that is immunized against privacy attacks but that still supports effective data mining tasks. Figure 1: Overview of Privacy-preserving data publishing Differential privacy is one of the strongest privacy definitions.. A differentially-private mechanism ensures that the probability of any output (released data) is equally likely from all nearly identical input data sets and thus guarantees that all outputs are insensitive to any individual’s data. Current techniques: Table 1: Raw data Table 2: Contingency table Our approach: Table 3: Generalized Contingency table Figure 2: Taxonomy tree for Job The proposed method is the first generalization-based algorithm for differentially private data release that preserves information for classification analysis. • Consider the raw data set of Table 1. • Step 1: The algorithm creates a generalized root partition. • Step 2: At each iteration the algorithmuses exponential mechanism to select a candidate for specialization • Step 3:The algorithm outputs each leaf partition along with their noisy counts using Laplace mechanism. Figure 3: Tree for partitioning records Experimental Evaluation • We employ the publicly available Adult data set.Adult has 45, 222 census records with 6 numerical attributes, 8 categorical attributes, and a binary class column representing two income levels, ≤50K or >50K. • We observe two general trends from the experiments. First, the privacy budget has a direct impact on the classification accuracy. Second, the classification accuracy initially increases with the increase of the number of specializations. However, after a certain threshold the accuracy decreases with the increase of the number of specializations. • We compare DiffGenwith DiffP-C4.5 [2], a differentially private interactive algorithm for building a classifier, and with the top-down specialization (TDS) approach [4] that publishes k-anonymous data for classification analysis. 0 + Lap(1/ε) For high-dimensional data, noise is too big Figure 5: Comparison Figure 4: Classification accuracy for Max Conclusions References C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, 2006. A. Friedman and A. Schuster. Data mining with differential privacy. In SIGKDD, 2010. B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu. Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys, 42(4):1–53, June 2010. B. C. M. Fung, K. Wang, and P. S. Yu. Anonymizingclassification data for privacy preservation. IEEE TKDE, 19(5):711–725, May 2007. The proposed solution connects the classical generalization technique with output perturbation to effectively anonymize raw data. Experimental results suggest that the proposed solution provides better utility than the interactive approach and the k-anonymous data, and that it is more effective than publishing a contingency table. Acknowledgments: The research is supported in part by the Discovery Grants (356065-2008), Strategic Project Grants (130570020), and Canada Graduate Scholarships from the Natural Sciences and Engineering Research Council of Canada, and the NSF grant IIS-0914934.