380 likes | 526 Views
Top-Down Specialization for Information and Privacy Preservation. Benjamin C. M. Fung Simon Fraser University BC, Canada bfung@cs.sfu.ca. Ke Wang Simon Fraser University BC, Canada wangk@cs.sfu.ca. Philip S. Yu IBM T.J. Watson Research Center psyu@us.ibm.com. IEEE ICDE 2005.
E N D
Top-Down Specialization for Information and Privacy Preservation Benjamin C. M. Fung Simon Fraser University BC, Canada bfung@cs.sfu.ca Ke Wang Simon Fraser University BC, Canada wangk@cs.sfu.ca Philip S. Yu IBM T.J. Watson Research Center psyu@us.ibm.com IEEE ICDE 2005
Conference Paper • Top-Down Specialization for Information and Privacy Preservation. • Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE 2005). • Conference paper and software: • http://www.cs.sfu.ca/~ddm • Software Privacy TDS 1.0
Outline • Problem: Anonymity for Classification • Our method: Top-Down Specialization (TDS) • Related works • Experimental results • Conclusions • Q & A
Motivation • Government and business have strong motivations for data mining. • Citizens have growing concern about protecting their privacy. • Can we satisfy both the data mining goal and the privacy goal?
Scenario • A data owner wants to release a person-specific data table to another party (or the public) for the purpose of classification without scarifying the privacy of the individuals in the released data. Person-specific data Data owner Data recipients
Credit? Unseen data (Masters, Female) Good Classification Goal Decision tree Training data Classification Algorithm Education Bachelors Doctorate Masters Sex G B Female Male G ……
Privacy Threat • If a description on (Education, Sex) is so specific that not many people match it, releasing the table will lead to linking a unique or a small number of individuals with sensitive information. Data recipients Adversary
Privacy Goal: Anonymity • The privacy goal is specified by the anonymity on a combination of attributes called Virtual Identifier (VID), where each description on a VID is required to be shared by at least k records in the table. • Anonymity requirement • Consider VID1,…, VIDp. e.g., VID = {Education, Sex}. • a(vidi) denotes the number of data records in T that share the value vidi on VIDi. e.g., vid = {Doctorate, Female}. • A(vidi) denotes the smallest a(vidi) for any value vidion VIDi. • A table T satisfies the anonymity requirement {<VID1, k1>, …, <VIDp, kp>} if A(vidi) ≥ kifor 1 ≤ i ≤ p, where ki is the anonymity threshold on VIDi specified by the data owner.
Anonymity Requirement Example: VID1 = {Education, Sex}, k1 = 4 A(VID1) = 1
Problem Definition • Anonymity for Classification: • Given a table T, an anonymity requirement, and a taxonomy tree of each categorical attribute in UVIDj, generalizeT to satisfy the anonymity requirement while preserving as much information as possible for classification. • A VID may contain both categorical and continuous attributes.
Solution: Generalization Generalize values in UVIDj.
Intuition • Classification goal and privacy goal have no conflicts: • Privacy goal: mask sensitive information, usually specific descriptions that identify individuals. • Classification goal: extract general structures that capture trends and patterns. • A table contains multiple classification structures. Generalizations destroy some classification structures, but other structures emerge to help. If generalization is “carefully” performed, identifying information can be masked while still preserving trends and patterns for classification.
Age ANY [1-99) [1-37) [37-99) Algorithm Top-Down Specialization (TDS) Initialize every value in T to the top most value. Initialize Cuti to include the top most value. while some x UCuti is valid do Find the Bestspecialization of the highest Score in UCuti. Perform the Bestspecializationon T and update UCuti. Update Score(x) and validity for x UCuti. end while return Generalized T and UCuti.
Search Criteria: Score • Consider a specialization v child(v). To heuristically maximize the information of the generalized data for achieving a given anonymity, we favor the specialization on v that has the maximum information gain for each unit of anonymity loss:
Age ANY [1-99) [1-37) [37-99) Search Criteria: Score • Rv denotes the set of records having value v before the specialization. Rc denotes the set of records having value c after the specialization where c child(v). • I(Rx) is the entropy of Rx [8]: • freq(Rx, cls) is the number records in Rx having the class cls. • Intuitively, I(Rx) measures the impurity of classes for the data records in Rx . A good specialization reduces the impurity of classes. • Also use InfoGain(v) to determine the optimal binary split on a continuous interval. [8]
Perform the Best Specialization • To perform the Best specialization Best child(Best), we need to retrieve RBest, the set of data records containing the value Best. • Taxonomy Indexed PartitionS (TIPS) is a tree structure with each node representing a generalized record over UVIDj, and each child node representing a specialization of the parent node on exactly one attribute. • Stored with each leaf node is the set of data records having the same generalized record.
Age ANY [1-99) [1-37) [37-99) Consider VID1 = {Education, Sex}, VID2 = {Sex, Age} LinkSecondary Link[37-99)
Count Statistics • While performing specialization, collect the following count statistics. • |Rc|, number of records containing c after specialization Best, where c child(Best). • |Rd|, number of records containing das ifd is specialized, where d child(c). • freq(Rc,cls), number of records in Rchaving class cls. • freq(Rd,cls), number of records in Rdhaving class cls. • |Pd|, number of records in partition Pd where Pd is a child partition under Pcas ifc is specialized. • Score can be updated using the above count statistics without accessing data records.
Age ANY [1-99) [1-37) [37-99) Update the Score • Update InfoGain(v): Specialization does not affect InfoGain(v) except for each value c child(Best). • Update AnonyLoss(v): Specialization affects Av(VIDj) which is the minimum a(vidj) after specializing Best. If attribute(v) and attribute(Best) are contained in the same VIDj, then AnonyLoss(v) has to be updated.
Update the Score • Need to efficiently extract a(vidj) from TIPS for updating Av(VIDj). • Virtual Identifier TreeS (VITS): • VITj for VIDj = {D1,…, Dw} is a tree of w levels. The level i > 0 represents the generalized values for Di. Each root-to-leaf path represents an existing vidj on VIDj in the generalized data, with a(vidj) stored at the leaf node.
Age Consider VID1 = {Education, Sex}, VID2 = {Sex, Age}
Benefits of TDS • Handling multiple VIDs • Treating all VIDs as a single VID leads to over generalization. • Handling both categorical and continuous attributes. • Dynamically generate taxonomy tree for continuous attributes. • Anytime solution • User may step through each specialization to determine a desired trade-off between privacy and accuracy. • User may stop any time and obtain a generalized table satisfying the anonymity requirement. Bottom-up approach does not support this feature. • Scalable computation
Related Works • The concept of anonymity was proposed by Dalenius [2]. • Sweeny [10] employed bottom-up generalization to achieve k-anonymity. • Single VID. Not considering specific use of data. • Iyengar [6] proposed a genetic algorithm (GA) to address the problem of anonymity for classification. • Single VID. • GA needs 18 hours while our method only needs 7 seconds to generalize same set of records (with comparable accuracy). • Wang et al. [12] recently proposed bottom-up generalization to address the same problem. • Only for categorical attributes. • Does not support Anytime Solution.
Experimental Evaluation • Data quality. • A broad range of anonymity requirements. • Used C4.5 and Naïve Bayesian classifiers. • Compare with Iyengar’s genetic algorithm [6]. • Results were quoted from [6]. • Efficiency and Scalability.
Data set • Adult data set • Used in Iyengar [6]. • Census data. • 6 continuous attributes. • 8 categorical attributes. • Two classes. • 30162 recs. for training. • 15060 recs. for testing.
Data Quality • Include the TopN most important attributes into a SingleVID, which is more restrictive than breaking them into multiple VIDs.
Data Quality • Include the TopN most important attributes into a SingleVID, which is more restrictive than breaking them into multiple VIDs.
Our method Data Quality • Compare with the genetic algorithm using C4.5. • Only categorical attributes. Same taxonomy trees.
Efficiency and Scalability • Took at most 10 seconds for all previous experiments. • Replicate the Adult data set and substitute some random data.
Conclusions • Quality classification and privacy preservation can coexist. • An effective top-down method to iteratively specialize the data, guided by maximizing the information utility and minimizing privacy specificity. • Simple but effective data structures. • Great applicability to both public and private sectors that share information for mutual benefits.
Thank you. Questions?
References • R. Agrawal and S. Ramakrishnan. Privacy preserving data mining. In Proc. of the ACM SIGMOD Conference on Management of Data, pages 439–450, Dallas, Texas, May 2000. ACM Press. • T. Dalenius. Finding a needle in a haystack - or identifying anonymous census record. Journal of Official Statistics, 2(3):329–336, 1986. • W. A. Fuller. Masking procedures for microdata disclosure limitation. Official Statistics, 9(2):383–406, 1993. • S. Hettich and S. D. Bay. The UCI KDD Archive, 1999. http://kdd.ics.uci.edu. • A. Hundepool and L. Willenborg. - and -argus: Software for statistical disclosure control. In the 3rd International Seminar on Statistical Confidentiality, Bled, 1996. • V. S. Iyengar. Transforming data to satisfy privacy constraints. In Proc. of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 279–288, Edmonton, AB, Canada, July 2002. • J. Kim and W. Winkler. Masking microdata files. In ASA Proceedings of the Section on Survey Research Methods, pages 114–119, 1995. • R. J. Quinlan. C4.5: Progams for Machine Learning. Morgan Kaufmann, 1993. • P. Samarati. Protecting respondents’ identities in microdata release. In IEEE Transactions on Knowledge Engineering, volume 13, pages 1010–1027, 2001. • L. Sweeney. Datafly: A system for providing anonymity in medical data. In International Conference on Database Security, pages 356–381, 1998.
References • The House of Commons in Canada. The personal information protection and electronic documents act, April 2000. http://www.privcom.gc.ca/. • K. Wang, P. Yu, and S. Chakraborty. Bottom-up generalization: a data mining solution to privacy protection. In Proc. of the 4th IEEE International Conference on Data Mining 2004 (ICDM 2004), November 2004.