170 likes | 455 Views
Randomization in Privacy Preserving Data Mining. Agrawal , R., and Srikant , R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include materials from this paper . Privacy-Preserving Data Mining. Problem: How do we publish data without compromising individual privacy?
E N D
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include materials from this paper
Privacy-Preserving Data Mining • Problem: How do we publish data without compromising individual privacy? • Solution : randomization, anonymization
Randomization • Adding random noise to original dataset • Challenge • Is data still useful for further analysis?
Randomization • Model: data is distorted by adding random noise • Original data X = {x1 . . .xN}, for record xi ∈ X, random variable Y = {y1 . . .yN} is added, so new data is denoted by Z ={ z1 . . .zN},zi=xi + yi. • yiis a random value • Uniform, [-α, +α] • Gaussian, N (0, σ2)
Reconstruction • Perturbed data hides data distribution and need be reconstructed before data mining • Given • x1+y1, x2+y2, ..., xn+yn – the probability distribution of Y • Estimate the probability distribution of x Clifton AusDM‘11
Reconstruction • Bayes rule to estimate cumulative density functions fx0 = Uniform distribution Repeat update until stop criterion met reconstruction algorithm
N(0, 0.25) reconstructed original original reconstructed randomized randomized (-0.5, 0.5)
Privacy Metric • If a data x is estimated to be in the interval [α, β] with c% confidence, then the interval (β-α) defines the amount of privacy with c% confidence. • Example Age 20-40, 95% confidence, 50% privacy in Uniform 2 α = 20*0.5/0.95 = 10.5
Training Decision Tree • Split point • interval boundaries • Reconstruction algorithm • Global • Byclass • Local • Dataset • Synthetic dataset, training set of 100,000 records and testing set of 5,000 records, equally split into two classes
byclass Byclass and local local original global and randomized original randomized global
Extended Work • ‘02 proposed a method to quantify information loss • Mutual information • ‘07 evaluated randomization with combining of public information • Gaussian is better than uniform • Dataset with inherent cluster pattern will improve randomization performance • Varying density and outliers will decrease performance
Multiplicative Randomization • Rotation randomization • Distorted by an orthogonal matrix • Projection randomization • Project high-dimensional dataset into low-dimensional space • Preserving Euclidean distance and can be applied with distance-based classification (KNN, SVM) and clustering (K-means)
Summary • Pros: data and noise are independent, can be applied during data collection time, useful for stream data • Cons: information loss, dimensionality curse