1 / 38

Li Xiong CS573 Data Privacy and Security

Privacy Preserving Data Mining – Secure multiparty computation and random response techniques. Li Xiong CS573 Data Privacy and Security. Outline. Privacy preserving two-party decision tree mining using SMC protocols ( Lindell & Pinkas ’00) Primitive SMC protocols Secure sum

thi
Download Presentation

Li Xiong CS573 Data Privacy and Security

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Privacy Preserving Data Mining – Secure multiparty computation and random response techniques Li Xiong CS573 Data Privacy and Security

  2. Outline • Privacy preserving two-party decision tree mining using SMC protocols (Lindell & Pinkas ’00) • Primitive SMC protocols • Secure sum • Secure union (encryption based) • Secure max (probabilistic random response based) • Secure union (probabilistic and randomization based) • Secure data mining using sub protocols • Random response for privacy preserving data mining or data sanitization

  3. Random response protocols Multi-round probabilistic protocols Randomization probability associated with each round Random response with randomization probability

  4. gi-1(r) gi(r) i vi Max Protocol – multi-round random response • Multiple rounds • Randomization Probability at round r : • Pr(r) = • Local algorithm at round r and node i:

  5. Max Protocol - Illustration Start 18 32 35 0 D2 D2 30 10 32 35 40 18 32 35 20 40 D4 D3 32 35 40

  6. Min/Max Protocol - Correctness • Precision bound: • Converges with r • Smaller p0 and d provides faster convergence

  7. Min/Max Protocol - Cost • Communication cost • single round: O(n) • Minimum # of rounds given precision guarantee (1-e):

  8. Min/Max Protocol - Security • Probability/confidence based metric: P(C|IR,R) • Different types of exposures based on claim • Data value: vi=a • Data ownership: Vi contains a • Change of beliefs • P(C|IR,R) – P(C|R) • P(C|IR, R) / P(C|R) • Relationship to privacy in anonymization • Change of beliefsP(C|D*, BR) – P(C|BR) Provable Exposure Absolute Privacy 0 1 0.5

  9. Min/Max Protocol – Security (Analysis) • Upper bound for average expected change of beliefs: max r 1/2r-1 * (1-P0*dr-1) • Larger p0 and d provides better privacy

  10. Min/Max Protocol – Security (Experiments) • Loss of privacy decreases with increasing number of nodes • Probabilistic protocol achieves better privacy (close to 0) • When n is large, anonymous protocol is actually okay!

  11. Union • Commutative encryption based approach • Number of rounds: 2 rounds • Each round: encryption and decryption • Multi-round random-response approach?

  12. 0 b1 1 b2 … 0 bL Vector p1 p2 pc VG • Each database has a boolean vector of the data items • Union vector is a logical OR of all vectors 1 0 1 0 1 1 … OR OR OR = … … … 0 0 0 Privacy Preserving Indexing of Documents on the Network, Bawa, 2003

  13. 0 1 0 1 0 1 … … … 0 0 0 vc v2 v1 0 1 1 1 1 1 1 0 1 1 0 1 0 0 … … … … … … … 0 0 0 0 0 0 0 vG’ vG’ vG’ vG’ vG’ vG’ vG’ Group Vector Protocol … Processing of VG’ at ps of round r Pex=1/2r, Pin=1-Pex for(i=1; i<L; i++) if (Vs[i]=1 and VG’[i]=0) Set VG’[i]=1 with prob. Pin if (Vs[i]=0 and VG’[i]=1) Set VG’[i]=0 with prob. Pex p2 pc p1 r=1, Pex=1/2, Pin=1/2 r=2, Pex=1/4, Pin=3/4

  14. Random Shares based Secure Union • Phase 1: random item addition • Multiple rounds with permutated ring • Each node sends a random share of its item set and a random share of a random item set • Phase 2: random item removal • Each node subtracts its random items set

  15. Random Shares based Secure Union - Analysis • Item exposure attack • An adversary makes a claim C on a particular item a node i contributes to the final result (C: vi in xi) • Set exposure attack • An adversary makes a claim C on the whole set of items a node i contributes to the final union result X (C: xi = ai). • Change of beliefs (posterior probability and prior probability) • P(C|IR,X) - P(C|X) • P(C|IR,X)/P(C|X)

  16. Exposure Risk – Set Exposure • Disclosure decreases with increasing number of generated random items and increasing number of participating nodes • Set exposure risk is or close to 0 for probabilistic and crypto approach

  17. Exposure Risk – Risk Exposure • Item exposure risk decreases with increasing number of generated random items and participating nodes • Item exposure risk for probabilistic approach is quite high

  18. Cost Comparison • Commutative protocol and anonymous communication protocol efficient but sensitive to union size • Probabilistic protocol efficient but sensitive to domain size • Estimated runtime for the general circuit-based protocol implemented by FairplayMP framework is 15 days, 127 days and 1.4 years for the domain sizes tested

  19. Open issues • Tradeoff between accuracy, efficiency, and security • How to quantify security • How to design adjustable protocols • Can we generalize the random-response algorithms and randomization algorithms for operators based on their properties • Operators: sum, union, max, min … • Properties: commutative, associative, invertible, randomizable

  20. Data Mining on Horizontally Partitioned Data Specific Secure Tools • Association Rule Mining • Decision Trees • EM Clustering • Naïve Bayes Classifier • Secure Sum • Secure Comparison • Secure Union • Secure Logarithm • Secure Poly. Evaluation

  21. Data Mining on Vertically Partitioned Data Specific Secure Tools • Association Rule Mining • Decision Trees • K-means Clustering • Naïve Bayes Classifier • Outlier Detection • Secure Comparison • Secure Set Intersection • Secure Dot Product • Secure Logarithm • Secure Poly. Evaluation

  22. Summary of SMC Based PPDDM • Mainly used for distributed data mining. • Efficient/specific cryptographic solutions for many distributed data mining problems are developed. • Random response or randomization based protocols offer tradeoff between accuracy, efficiency, and security • Mainly semi-honest assumption(i.e. parties follow the protocols)

  23. Ongoing research • New models that can trade-off better between efficiency and security • Game theoretic / incentive issues in PPDM

  24. Outline • Privacy preserving two-party decision tree mining using SMC protocols (Lindell & Pinkas ’00) • Primitive SMC protocols • Secure sum • Secure union (encryption based) • Secure max (probabilistic random response based) • Secure union (probabilistic and randomization based) • Secure data mining using sub protocols • Random response for privacy preserving data mining or data collection

  25. Data Collection Model Data cannot be shared directly because of privacy concern

  26. Randomized Response The true answer is “Yes” Do you smoke? Yes Head Biased coin: No Tail

  27. Randomized Response • Multiple attributes encoded in bits True answer E: 110 Head Biased coin: False answer !E: 001 Tail Using Randomized Response Techniques for Privacy-Preserving Data Mining, Du, 2003

  28. Generalization for Multi-Valued Categorical Data Si Si+1 Si+2 Si+3 q1 q2 q3 q4 True Value: Si M

  29. A Generalization • RR Matrices [Warner 65], [R.Agrawal 05], [S. Agrawal 05] • RR Matrix can be arbitrary • Can we find optimal RR matrices? OptRR:Optimizing Randomized Response Schemes for Privacy-Preserving Data Mining, Huang, 2008

  30. What is an optimal matrix? • Which of the following is better?

  31. What is an optimal matrix? • Which of the following is better? Privacy:M2is better Utility:M1is better So, what is an optimal matrix?

  32. Optimal RR Matrix • An RR matrix M is optimal if no other RR matrix’s privacy and utility are both better than M (i, e, no other matrix dominates M). • Privacy Quantification • Utility Quantification • A number of privacy and utility metrics have been proposed. • Privacy: how accurately one can estimate individual info. • Utility: how accurately we can estimate aggregate info.

  33. Optimization Methods • Approach 1: Weighted sum: w1 Privacy + w2 Utility • Approach 2 • Fix Privacy, find M with the optimal Utility. • Fix Utility, find M with the optimal Privacy. • Challenge: Difficult to generate M with a fixed privacy or utility. • Proposed Approach: Multi-Objective Optimization

  34. Optimization algorithm • Evolutionary Multi-Objective Optimization (EMOO) • The algorithm • Start with a set of initial RR matrices • Repeat the following steps in each iteration • Mating: selecting two RR matrices in the pool • Crossover: exchanging several columns between the two RR matrices • Mutation: change some values in a RR matrix • Meet the privacy bound: filtering the resultant matrices • Evaluate the fitness value for the new RR matrices. Note : the fitness values is defined in terms of privacy and utility metrics

  35. Illustration

  36. Output of Optimization • The optimal set is often plotted in the objective space as Pareto front. Worse M6 M5 M4 M8 M7 M3 M2 Utility M1 Better Privacy

  37. For First attribute of Adult data

  38. Summary • Privacy preserving data mining • Secure multi-party computation protocols • Random response techniques for computation and data collection • Knowledge sensitive data mining

More Related