180 likes | 295 Views
Other Perturbation Techniques. Outline. Randomized Responses Sketch Project ideas . Randomized Responses. Problem description A provides the answer to B’s question A wants to preserve his/her privacy Question/answer can be sensitive The method Assume the answer can be “yes” or “no”
E N D
Outline • Randomized Responses • Sketch • Project ideas
Randomized Responses • Problem description • A provides the answer to B’s question • A wants to preserve his/her privacy • Question/answer can be sensitive • The method • Assume the answer can be “yes” or “no” • A has a probability to be honest, and the probability 1- to give a random response • We can estimate the real probability of “yes” and “no” from the randomized responses
Notations: • O(yes): observed probability of yes from the randomized responses • # of yes/total # of responses • P(yes): real probability of yes • Inference • O(yes) = P(yes) * + P(no)*(1-) = P(yes) * + (1-P(yes))*(1-) P(yes) = (O(yes)+-1)/(2-1)
Extend to multiple categories • The answer ci has a prob ij changed to cj • O((c1,c2,…,cn)): observed prob of ci • P((c1,c2,…,cn)) : real prob of ci • The relationship between O and P Note: When is invertible, use matrix inversion to solve P. Otherwise, use iterative methods similar to that in Rakesh’s paper
Different perturbation matrices can be used. Which one is the best? • Balance between privacy and utility? Zero privacy is preserved, while full data utility is preserved Uniform randomization, privacy is fully preserved, while no data utility is left
Optimizing both privacy&utility • Read paper 33 • Privacy: similar to previous discussion • Based on accuracy of estimation • A Bayes method: • C = {c1,c2,…,cn) • Y is the perturbed value, X is the original value, and X^ is the estimated value Accuracy of estimation * It can be calculated by checking the original data, the perturbed data and the estimated data
Privacy • Average: 1- (accuracy of estimation) • Worst case: • Utility • P(ci) the original prob, O(ci) the prob on perturbed data, P^(ci) is the estimated prob • Utility depends on the difference between the original prob and the estimated prob
Optimization algorithm • Find the perturbation that balance the two metrics • The evolutionary algorithm • Start with a set of initial RR matrices • Repeat the following steps in each iteration • Mating: selecting two RR matrices in the pool • Crossover: exchanging several columns between the two RR matrices • Mutation: change some values in a RR matrix • Meet the privacy bound: filtering the resultant matrices • Evaluate the fitness value for the new RR matrices. Note : the fitness values is defined in terms of privacy and utility metrics
summary • Randomized response is the basic technique for perturbing categorical data • Boolean • Multi-category
Sketch • Address the problem of high-dimensional sparse data • Multiplicative perturbation • Randomized responses • Market basket data • Bag of words
Definition of sketch • Similar to projection perturbation • Map d dimensional data r dimensional data, r<<d • Difference: for each record the mapping matrix is different • Definition • X = (x1,…xd), S(s1,…,sr) is randomly drawn from {-1, +1}
property • Dot product of the original data X and Y can be approximated with their sketches • Dot product is important in calculating Euclidean distances!
Accuracy of the dot product estimation Large r smaller variance better quality however, lower privacy
Privacy • Original data value can be estimated • Sparse data • Most are canceled in sketch • Estimate of xk :
privacy • - anonimity Suppress the record if this condition is not satisfied… Another concept: K-variance paper 29 for more details.
Applications: • Dot product estimation • Determine the length of sparse transaction (# of non-zero items in boolean vector) • Determine Euclidean distance • Average of a set of records (centroid of a cluster)