330 likes | 646 Views
Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification. Wenliang (Kevin) Du, Zhouxuan Teng, and Zutao Zhu. Department of Electrical Engineering & Computer Science Syracuse University, Syracuse, New York. Introduction. Privacy-Preserving Data Publishing.
E N D
Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification Wenliang (Kevin) Du, Zhouxuan Teng, and Zutao Zhu. Department of Electrical Engineering & Computer Science Syracuse University, Syracuse, New York.
Introduction • Privacy-Preserving Data Publishing. • The impact of background knowledge: • How does it affect privacy? • How to measure its impact on privacy? • Integrate background knowledge in privacy quantification. • Privacy-MaxEnt: A systematic approach. • Based on well-established theories. • Evaluation.
Privacy-Preserving Data Publishing • Data disguise methods • Randomization • Generalization (e.g. Mondrian) • Bucketization (e.g. Anatomy) • Our Privacy-MaxEnt method can be applied to Generalization and Bucketization. • We pick Bucketization in our presentation.
Data Sets Identifier Quasi-Identifier (QI) Sensitive Attribute (SA)
Bucketized Data Quasi-Identifier (QI) Sensitive Attribute (SA) P( Breastcancer | {female, college}, bucket=1 ) = 1/4 P( Breastcancer | {female, junior}, bucket=2 ) = 1/3
Background Knowledge: It’s rare for male to have breast cancer. Impact of Background Knowledge • This analysis is hard for large data sets.
Previous Studies • Martin, et al. ICDE’07. • First formal study on background knowledge • Chen, LeFevre, Ramakrishnan. VLDB’07. • Improves the previous work. • They deal with rule-based knowledge. • Deterministic knowledge. • Background knowledge can be much more complicated. • Uncertain knowledge
Complicated Background Knowledge • Rule-based knowledge: • P (s | q) = 1. • P (s | q) = 0. • Probability-Based Knowledge • P (s | q) = 0.2. • P (s | Alice) = 0.2. • Vague background knowledge • 0.3 ≤ P (s | q) ≤ 0.5. • Miscellaneous types • P (s | q1) + P (s | q2) = 0.7 • One of Alice and Bob has “Lung Cancer”.
Challenges • How to analyze privacy in a systematic way for large data sets and complicated background knowledge? • What do we want to compute? • P( S | Q ), given the background knowledge and the published data set. • P(S | Q ) is primitive for most privacy metrics. • Directly computing P( S | Q ) is hard.
Our Approach Consider P( S | Q )as variable x (a vector). Background Knowledge Constraints on x Solve x Published Data Constraints on x Most unbiased solution Public Information
Maximum Entropy Principle • “Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and leads to a type of statistical inference which is called the maximum entropy estimate. It is least biased estimate possible on the given information.” — by E. T. Jaynes, 1957.
The MaxEnt Approach Background Knowledge Constraints on P( S | Q ) Maximum Entropy Estimate Estimate P( S | Q ) Published Data Constraints on P( S | Q ) Public Information
Entropy Because H(S | Q, B) = H(Q, S, B) – H(Q, B) Constraint should use P(Q, S, B) as variables
Maximum Entropy Estimate • Let vector x = P(Q, S, B). • Find the value for x that maximizes its entropy H(Q, S, B), while satisfying • h1(x) = c1, …, hu(x) = cu : equality constraints • g1(x) ≤ d1, …, gv(x) ≤ dv : inequality constraints • A special case of Non-Linear Programming.
Constraints from Knowledge • Linear model: quite generic. • Conditional probability: • P (S | Q) = P(Q, S) / P(Q). • Background knowledge has nothing to do with B: • P(Q, S) = P(Q, S, B=1) + … + P(Q, S, B=m). Background Knowledge Constraints on P(Q, S, B)
Constraints from Published Data • Constraints • Truth and only the truth. • Absolutely correct for the original data set. • No inference. Published Data Set D’ Constraints on P(Q, S, B)
Assignment and Constraints Observation: the original data is one of the assignments Constraint: true for all possible assignments
QI Constraint Constraint: Example:
SA Constraint Constraint: Example:
Zero Constraint • P(q, s, b) = 0, if q or s does not appear in Bucket b. • We can reduce the number of variables.
Theoretic Properties • Soundness: Are they correct? • Easy to prove. • Completeness: Have we missed any constraint? • See our theorems and proofs. • Conciseness: Are there redundant constraints? • Only one redundant constraint in each bucket. • Consistency: Is our approach consistent with the existing methods (i.e., when background knowledge is Ø).
Completeness w.r.t Equations • Have we missed any equality constraint? • Yes! • If F1 = C1 and F2 = C2 are constraints, F1 + F2 = C1 + C2 is too. However, it is redundant. • Completeness Theorem: • U: our constraint set. • All linear constraints can be written as the linear combinations of the constraints in U.
Completeness w.r.t Inequalities • Have we missed any inequalities constraint? • Yes! • If F = C, then F ≤ C+0.2is also valid (redundant). • Completeness Theorem: • Our constraint set is also complete in the inequality sense.
Putting Them Together Tools: LBFGS, TOMLAB, KNITRO, etc. Background Knowledge Constraints on P( S | Q ) Maximum Entropy Estimate Estimate P( S | Q ) Published Data Constraints on P( S | Q ) Public Information
Inevitable Questions: • Where do we get background knowledge? • Do we have to be very very knowledgeable? • For P (s | q) type of knowledge: • All useful knowledge is in the original data set. • Association rules: • Positive: Q S • Negative: Q ¬S, ¬Q S, ¬Q ¬S • Bound the knowledge in our study. • Top-K strongest association rules.
Knowledge about Individuals Alice: (i1, q1) Bob: (i4, q2) Charlie: (i9, q5) Knowledge 1: Alice has either s1 or s4. Constraint: Knowledge 1: Two people among Alice, Bob, and Charlie have s4. Constraint:
Evaluation • Implementation: • Lagrange multipliers: Constrained Optimization Unconstrained Optimization • LBFGS: solving the unconstrained optimization problem. • Pentium 3Ghz CPU with 4GB memory.
Privacy versus Knowledge Estimation Accuracy: KL Distance between P(MaxEnt) (S | Q) and P(Original) (S | Q).
Conclusion • Privacy-MaxEnt is a systematic method • Model various types of knowledge • Model the information from the published data • Based on well-established theory. • Future work • Reducing the # of constraints • Vague background knowledge • Background knowledge about individuals