1 / 33

Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification

Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification. Wenliang (Kevin) Du, Zhouxuan Teng, and Zutao Zhu. Department of Electrical Engineering & Computer Science Syracuse University, Syracuse, New York. Introduction. Privacy-Preserving Data Publishing.

may
Download Presentation

Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification Wenliang (Kevin) Du, Zhouxuan Teng, and Zutao Zhu. Department of Electrical Engineering & Computer Science Syracuse University, Syracuse, New York.

  2. Introduction • Privacy-Preserving Data Publishing. • The impact of background knowledge: • How does it affect privacy? • How to measure its impact on privacy? • Integrate background knowledge in privacy quantification. • Privacy-MaxEnt: A systematic approach. • Based on well-established theories. • Evaluation.

  3. Privacy-Preserving Data Publishing • Data disguise methods • Randomization • Generalization (e.g. Mondrian) • Bucketization (e.g. Anatomy) • Our Privacy-MaxEnt method can be applied to Generalization and Bucketization. • We pick Bucketization in our presentation.

  4. Data Sets Identifier Quasi-Identifier (QI) Sensitive Attribute (SA)

  5. Bucketized Data Quasi-Identifier (QI) Sensitive Attribute (SA) P( Breastcancer | {female, college}, bucket=1 ) = 1/4 P( Breastcancer | {female, junior}, bucket=2 ) = 1/3

  6. Background Knowledge: It’s rare for male to have breast cancer. Impact of Background Knowledge • This analysis is hard for large data sets.

  7. Previous Studies • Martin, et al. ICDE’07. • First formal study on background knowledge • Chen, LeFevre, Ramakrishnan. VLDB’07. • Improves the previous work. • They deal with rule-based knowledge. • Deterministic knowledge. • Background knowledge can be much more complicated. • Uncertain knowledge

  8. Complicated Background Knowledge • Rule-based knowledge: • P (s | q) = 1. • P (s | q) = 0. • Probability-Based Knowledge • P (s | q) = 0.2. • P (s | Alice) = 0.2. • Vague background knowledge • 0.3 ≤ P (s | q) ≤ 0.5. • Miscellaneous types • P (s | q1) + P (s | q2) = 0.7 • One of Alice and Bob has “Lung Cancer”.

  9. Challenges • How to analyze privacy in a systematic way for large data sets and complicated background knowledge? • What do we want to compute? • P( S | Q ), given the background knowledge and the published data set. • P(S | Q ) is primitive for most privacy metrics. • Directly computing P( S | Q ) is hard.

  10. Our Approach Consider P( S | Q )as variable x (a vector). Background Knowledge Constraints on x Solve x Published Data Constraints on x Most unbiased solution Public Information

  11. Maximum Entropy Principle • “Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and leads to a type of statistical inference which is called the maximum entropy estimate. It is least biased estimate possible on the given information.” — by E. T. Jaynes, 1957.

  12. The MaxEnt Approach Background Knowledge Constraints on P( S | Q ) Maximum Entropy Estimate Estimate P( S | Q ) Published Data Constraints on P( S | Q ) Public Information

  13. Entropy Because H(S | Q, B) = H(Q, S, B) – H(Q, B) Constraint should use P(Q, S, B) as variables

  14. Maximum Entropy Estimate • Let vector x = P(Q, S, B). • Find the value for x that maximizes its entropy H(Q, S, B), while satisfying • h1(x) = c1, …, hu(x) = cu : equality constraints • g1(x) ≤ d1, …, gv(x) ≤ dv : inequality constraints • A special case of Non-Linear Programming.

  15. Constraints from Knowledge • Linear model: quite generic. • Conditional probability: • P (S | Q) = P(Q, S) / P(Q). • Background knowledge has nothing to do with B: • P(Q, S) = P(Q, S, B=1) + … + P(Q, S, B=m). Background Knowledge Constraints on P(Q, S, B)

  16. Constraints from Published Data • Constraints • Truth and only the truth. • Absolutely correct for the original data set. • No inference. Published Data Set D’ Constraints on P(Q, S, B)

  17. Assignment and Constraints Observation: the original data is one of the assignments Constraint: true for all possible assignments

  18. QI Constraint Constraint: Example:

  19. SA Constraint Constraint: Example:

  20. Zero Constraint • P(q, s, b) = 0, if q or s does not appear in Bucket b. • We can reduce the number of variables.

  21. Theoretic Properties • Soundness: Are they correct? • Easy to prove. • Completeness: Have we missed any constraint? • See our theorems and proofs. • Conciseness: Are there redundant constraints? • Only one redundant constraint in each bucket. • Consistency: Is our approach consistent with the existing methods (i.e., when background knowledge is Ø).

  22. Completeness w.r.t Equations • Have we missed any equality constraint? • Yes! • If F1 = C1 and F2 = C2 are constraints, F1 + F2 = C1 + C2 is too. However, it is redundant. • Completeness Theorem: • U: our constraint set. • All linear constraints can be written as the linear combinations of the constraints in U.

  23. Completeness w.r.t Inequalities • Have we missed any inequalities constraint? • Yes! • If F = C, then F ≤ C+0.2is also valid (redundant). • Completeness Theorem: • Our constraint set is also complete in the inequality sense.

  24. Putting Them Together Tools: LBFGS, TOMLAB, KNITRO, etc. Background Knowledge Constraints on P( S | Q ) Maximum Entropy Estimate Estimate P( S | Q ) Published Data Constraints on P( S | Q ) Public Information

  25. Inevitable Questions: • Where do we get background knowledge? • Do we have to be very very knowledgeable? • For P (s | q) type of knowledge: • All useful knowledge is in the original data set. • Association rules: • Positive: Q  S • Negative: Q  ¬S, ¬Q  S, ¬Q  ¬S • Bound the knowledge in our study. • Top-K strongest association rules.

  26. Knowledge about Individuals Alice: (i1, q1) Bob: (i4, q2) Charlie: (i9, q5) Knowledge 1: Alice has either s1 or s4. Constraint: Knowledge 1: Two people among Alice, Bob, and Charlie have s4. Constraint:

  27. Evaluation • Implementation: • Lagrange multipliers: Constrained Optimization Unconstrained Optimization • LBFGS: solving the unconstrained optimization problem. • Pentium 3Ghz CPU with 4GB memory.

  28. Privacy versus Knowledge Estimation Accuracy: KL Distance between P(MaxEnt) (S | Q) and P(Original) (S | Q).

  29. Privacy versus # of QI attributes

  30. Performance vs. Knowledge

  31. Running Time vs. Data Size

  32. Iteration vs. Data size

  33. Conclusion • Privacy-MaxEnt is a systematic method • Model various types of knowledge • Model the information from the published data • Based on well-established theory. • Future work • Reducing the # of constraints • Vague background knowledge • Background knowledge about individuals

More Related