320 likes | 440 Views
Instituto de Investigación en Inteligencia Artificial Consejo Superior de Investigaciones Científicas. ONN the use of Neural Networks for Data Privacy. Jordi Pont-Tuset Pau Medrano Gracia Jordi Nin Josep Lluís Larriba Pey Victor Muntés i Mulero. Presentation Schema. Motivation
E N D
Instituto de Investigación en Inteligencia Artificial Consejo Superior de Investigaciones Científicas ONN the use of Neural Networksfor Data Privacy Jordi Pont-Tuset Pau Medrano Gracia Jordi Nin Josep Lluís Larriba Pey Victor Muntés i Mulero
Presentation Schema • Motivation • Basic Concepts • Ordered Neural Networks (ONN) • Experimental Results • Conclusions and Future Work
Our Scenario: attribute classification • Classification of attributes • Identifiers (ID) • Quasi-identifiers • Confidential (C) • Non-Confidential (NC) ID ID C NC NC
Data Privacy and Anomymization Original Data Released Data External Data Source Record Linkage NC NC ID Confidential data disclosure!!! C ID
Data Privacy and Anomymization Anonymization Process Goal: Ensure protection while preserving statistical usefulness External Data Source Record Linkage NC’ NC NC Trade-Off: Accuracy vs Privacy ? ID Privacy in Statistical Database (PSD) Privacy Preserving Data Mining (PPDM) C ID
Presentation Schema • Motivation • Basic Concepts • Ordered Neural Networks (ONN) • Experimental Results • Conclusions and Future Work
Best Ranked Protection Methods [DT01] • Rank Swapping (RS-p) [Moore96] • Sorts values in each attribute and swaps them randomly with a restricted range of size p • Microaggregation (MIC-vm-k) [DM02] • Builds small clusters from v variables of at least k elements • Then, it replaces each value by the centroid of the cluster to which it belongs [DT01] Domingo-Ferrer, J., Torra, V.: A quantitative comparison of disclosure control methods for microdata. In: Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, Elsevier Science (2001) 111-133 [Moore96]Moore, R.: Controlled data swapping techniques for masking public use microdata sets. U.S. Bureau of the Census (Unpublished manuscript) (1996) [DM02] Domingo-Ferrer, J., Mateo-Sanz, J.M.: Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. on KDE 14 (2002) 189-201
Our contribution... • We propose a new perturvative protection method for numerical data based on the use of neural networks • Basic idea: learning a pseudo-identity function (quasi-learning ANNs) • Anonymizing numerical data sets
Artificial Neural Networks • Each neuron weights its inputs and applies an activation function: • For our purpose, we assume ANNs without feedback connections and layer-bypassing (Sigmoid)
Backpropagation Algorithm • Allows the ANN to learn from a predefined set of input-output example pairs • It adjusts weights in the ANN iteratively • In each iteration we calculate the error in the output layer using a sum of the squared difference • Weights are updated using an iterative steepest descent method
Presentation Schema • Motivation • Basic Concepts • Ordered Neural Networks (ONN) • Experimental Results • Conclusions and FutureWork
Ordered Neural Networks (ONN) • Key idea: innacurately learning the original data set, using ANNs, in order to reproduce a similar one: • Similar enough to preserve the properties of the original data set • Different enough not to reveal the original confidential values
Ordered Neural Networks (ONN) • How can we learn the original data set? a Try to learn the original data set with a single neural network TOO COMPLEX n
Ordered Neural Networks (ONN) • How can we learn the original data set? a The pattern to be learnt may still be too complex a n We could sort each attribute independently in order to simplify the learning process
Ordered Neural Networks (ONN) • How can we learn the original data set? a The concept of tuple is lost! a Reordering of each attribute separately n Why are we so keen on preserving the attribute semantics?
Ordered Neural Networks (ONN) • Different approach: • We ignore the attribute semantics mixing all the values in the database • We sort them to make the learning process easier • We partition the values into several blocks in order to simplify the learning process
Vectorization • ONN ignores the attribute semantics to reduce the learning process cost
Sorting • Objective: simplifying the learning process and reduce learning time
Partitioning • The size of the data set may be very large • A single ANN would make the learning process very difficult • ONN will use a different ANN for each partition k
Normalize • In order to make the learning process possible, it is necessary to normalize input data We normalize the values for their images to fit in the range where the slope of the activation function is rellevant [FS91] [FS91] Freeman, J.A., Skapura, D.M. In: Neural Networks: Algorithms, Applications and Programming Techniques. Addison-Wesley Publishing Company (1991) 1-106
Learning Step • Given P partitions, we have one ANN per partition • Each ANN is fed with values coming from the P partitions in order to add noise k a p1 d p1 a b c ? = a a’ g Backpropagation p2 d e f P pP p3 g h i
Learning Step • Given P partitions, we have one ANN per partition • Each ANN is fed with values coming from the P partitions in order to add noise k c p1 f p1 a b c ? = c c’ i p2 d e f P pP p3 g h i
Protection Step • First, we propagate the original data set through the trained ANNs • Finally, we derandomized the generated values k a p1 d p1 a b c a’ g De-normalization p2 d e f P pP p3 g h i
Presentation Schema • Motivation • Basic Concepts • Ordered Neural Networks (ONN) • Experimental Results • Conclusions and FutureWork
Experiments Setup • Data used in CASC Project (http://neon.vb.cbs.nl/casc) • Data from US Census Bureau: • 1080 tuples x13 attributes =14040 values to be protected • We compare our algorithm with the best 5 parameterizations presented in the literature for: • Rank Swapping • Microaggregation • ONN is parameterized adhoc
Experiments Setup • ONN parameterization: • P: Number of Partitions • B: Normalization Range Size • E: Learning Rate Parameter • C: Activation Function Slope Parameter • H: Number of neurons in the hidden layer
Score: Protection Methods Evaluation • We need a protection quality score that measures: • The difficulty for an intruder to reveal the original data • The information loss in the protected data set Score = 0.5 IL + 0.5 DR IL = 100(0.2 IL1 + 0.2 IL2 + 0.2 IL3 + 0.2 IL4 + 0.2 IL5) DR = 0.5 DLD + 0.5 ID IL1 = mean of absolute error DLD = number of links using DBRL IL2 = mean variation of average ID = protected values near orginal IL3 = mean variation of variance IL4 = mean variation of covariancie IL5 = mean variation of correlation
Results 7 variables 13 variables
Presentation Schema • Motivation • Basic Concepts • Ordered Neural Networks (ONN) • Experimental Results • Conclusions and FutureWork
Conclusions & Future Work • The use of ANNs combined with some preprocessing techniques is promising for protection methods • In our experiments ONN is able to improve the protection quality of the best ranked protection methods in the literature • As future work, we would like to establish a set of criteria to automatically tune the parameters of ONN
Any questions? Contact e-mail: vmuntes@ac.upc.edu DAMA Group Web Site: http://www.dama.upc.edu