230 likes | 476 Views
Technical Seminar On Privacy Preserving Data Mining. Under The guidance of Indraneel Mukhopadhyay By Sarmila Acharya Roll No. :200157041 Branch: IT. Statistical approaches Alter the frequency ( PRAN/DS/PERT ) of particular features, while preserving means.
E N D
Technical Seminar On Privacy Preserving Data Mining Under The guidance of Indraneel Mukhopadhyay By Sarmila Acharya Roll No. :200157041 Branch: IT
Statistical approaches Alter the frequency (PRAN/DS/PERT) of particular features, while preserving means. Additionally, erase values that reveal too much Query-based approaches involve a permanent trusted third party Query monitoring: dissallow queries that breach privacy Perturbation: Add noise to the query output Statistical perturbation + adversarial analysis Combine statistical techniques with analysis similar to query-based approaches Database Privacy
Popular Press: Economist: The End of Privacy (May 99) Time: The Death of Privacy (Aug 97) Govt. directives/commissions: European directive on privacy protection (Oct 98) Canadian Personal Information Protection Act (Jan 2001) Surveys of web users 17% privacy fundamentalists, 56% pragmatic majority, 27% marginally concerned (Understanding net users' attitude about online privacy, April 99) 82% said having privacy policy would matter. Growing Privacy Concerns
Privacy Preserving Methods • Two methods were used for modifying values : • Value-Class Membership • In this method, the values for an attribute are partitioned into a set of disjoint, mutually exclusive classes. • Value Distortion. • Return a value xi + r instead of xi where r is a random value drawn from some distribution. • Two random distributions were used: • ·Uniform: The random variable has a uniform distribution, between [-, + ]. The mean value of the random variable is 0. • ·Gaussian: The random variable has a normal distribution, with mean = 0 and a standard deviation .
For quantifying privacy provided by a method, we use a measure based on how closely the original values of a modified attribute can be estimated. Confidence 50% 95% 99.9% Discretization Uniform Gaussian 0.5 x W 0.5 x 2 1.34 x 0.95x W 0.95x 2 3.92x 0.999xW 0.999x 2 6.8x Quantifying Privacy
Original values x1, x2, ..., xn from probability distribution X (unknown) To hide these values, we use y1, y2, ..., yn from probability distribution Y Given x1+y1, x2+y2, ..., xn+yn the probability distribution of Y Estimate the probability distribution of X. Reconstruction Problem
Use Bayes' rule for density functions Intuition (Reconstruct single point)
Reconstructing the Distribution • Combine estimates of where point came from for all the points: • Gives estimate of original distribution.
fX0 := Uniform distribution j := 0 // Iteration number repeat fXj+1(a) := (Bayes' rule) j := j+1 until (stopping criterion met) Converges to maximum likelihood estimate. Reconstruction Algorithm
Algorithm Partition(Data S) begin if (most points in S belong to same class) return; for each attribute A evaluate splits on attribute A; Use best split to partition S into S1 and S2; Partition(S1); Partition(S2); end Decision Tree Classification:Randomized Data
Need to modify two key operations: Determining split point Partitioning data Reconstructing the Original Distribution: Reconstruct using the whole data (Globally) or reconstruct separately for each class (ByClass). Reconstruct once at the root node or at every node (Local). Training using Randomized Data
We consider three different algorithms that differ in when and how distributions are reconstructed: Global: Reconstruct the distribution for each attribute once at the beginning using the complete perturbed training data. Induce decision tree using the reconstructed data. ByClass: For each attribute, first split the training data by class, then reconstruct the distributions separately for each class. Induce decision tree using the reconstructed data. Local: As in ByClass, for each attribute, split the training data by class and reconstruct distributions separately for each class. However, instead of doing reconstruction only once, reconstruction is done at each node Reconstructing the Original Distribution
Experimental Methodology • Compare accuracy against • Original: unperturbed data without randomization. • Randomized: perturbed data but without making any corrections for randomization. • Test data not randomized. • Training set of 100,000 records, split equally between the two classes.
Problem: Two parties owning confidential databases wish to build a decision-tree classifier on the union of their databases, without revealing any unnecessary information. Horizontally partitioned. Records (users) split across companies. Example: Credit card fraud detection model. Vertically partitioned. Attributes split across companies. Example: Associations across websites. Inter-Enterprise Data Mining
In this paper, we studied the technical feasibility of realizing privacy-preserving data mining. The basic premise was that the sensitive values in a user's record will be perturbed using a randomizing function so that they cannot be estimated with sufficient precision. Randomization can be done using Gaussian or Uniform perturbations. For the specific case of decision-tree classification, we found two effective algorithms, ByClass and Local. . The algorithms rely on a Bayesian procedure for correcting perturbed distributions. Conclusion