Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification

Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification Wenliang (Kevin) Du, Zhouxuan Teng, and Zutao Zhu. Department of Electrical Engineering & Computer Science Syracuse University, Syracuse, New York.

Introduction • Privacy-Preserving Data Publishing. • The impact of background knowledge: • How does it affect privacy? • How to measure its impact on privacy? • Integrate background knowledge in privacy quantification. • Privacy-MaxEnt: A systematic approach. • Based on well-established theories. • Evaluation.

Privacy-Preserving Data Publishing • Data disguise methods • Randomization • Generalization (e.g. Mondrian) • Bucketization (e.g. Anatomy) • Our Privacy-MaxEnt method can be applied to Generalization and Bucketization. • We pick Bucketization in our presentation.

Data Sets Identifier Quasi-Identifier (QI) Sensitive Attribute (SA)

Bucketized Data Quasi-Identifier (QI) Sensitive Attribute (SA) P( Breastcancer | {female, college}, bucket=1 ) = 1/4 P( Breastcancer | {female, junior}, bucket=2 ) = 1/3

Background Knowledge: It’s rare for male to have breast cancer. Impact of Background Knowledge • This analysis is hard for large data sets.

Previous Studies • Martin, et al. ICDE’07. • First formal study on background knowledge • Chen, LeFevre, Ramakrishnan. VLDB’07. • Improves the previous work. • They deal with rule-based knowledge. • Deterministic knowledge. • Background knowledge can be much more complicated. • Uncertain knowledge

Challenges • How to analyze privacy in a systematic way for large data sets and complicated background knowledge? • What do we want to compute? • P( S | Q ), given the background knowledge and the published data set. • P(S | Q ) is primitive for most privacy metrics. • Directly computing P( S | Q ) is hard.

Our Approach Consider P( S | Q )as variable x (a vector). Background Knowledge Constraints on x Solve x Published Data Constraints on x Most unbiased solution Public Information

Maximum Entropy Principle • “Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and leads to a type of statistical inference which is called the maximum entropy estimate. It is least biased estimate possible on the given information.” — by E. T. Jaynes, 1957.

The MaxEnt Approach Background Knowledge Constraints on P( S | Q ) Maximum Entropy Estimate Estimate P( S | Q ) Published Data Constraints on P( S | Q ) Public Information

Entropy Because H(S | Q, B) = H(Q, S, B) – H(Q, B) Constraint should use P(Q, S, B) as variables

Maximum Entropy Estimate • Let vector x = P(Q, S, B). • Find the value for x that maximizes its entropy H(Q, S, B), while satisfying • h1(x) = c1, …, hu(x) = cu : equality constraints • g1(x) ≤ d1, …, gv(x) ≤ dv : inequality constraints • A special case of Non-Linear Programming.

Constraints from Knowledge • Linear model: quite generic. • Conditional probability: • P (S | Q) = P(Q, S) / P(Q). • Background knowledge has nothing to do with B: • P(Q, S) = P(Q, S, B=1) + … + P(Q, S, B=m). Background Knowledge Constraints on P(Q, S, B)

Constraints from Published Data • Constraints • Truth and only the truth. • Absolutely correct for the original data set. • No inference. Published Data Set D’ Constraints on P(Q, S, B)

Assignment and Constraints Observation: the original data is one of the assignments Constraint: true for all possible assignments

QI Constraint Constraint: Example:

SA Constraint Constraint: Example:

Zero Constraint • P(q, s, b) = 0, if q or s does not appear in Bucket b. • We can reduce the number of variables.

Theoretic Properties • Soundness: Are they correct? • Easy to prove. • Completeness: Have we missed any constraint? • See our theorems and proofs. • Conciseness: Are there redundant constraints? • Only one redundant constraint in each bucket. • Consistency: Is our approach consistent with the existing methods (i.e., when background knowledge is Ø).

Completeness w.r.t Equations • Have we missed any equality constraint? • Yes! • If F1 = C1 and F2 = C2 are constraints, F1 + F2 = C1 + C2 is too. However, it is redundant. • Completeness Theorem: • U: our constraint set. • All linear constraints can be written as the linear combinations of the constraints in U.

Completeness w.r.t Inequalities • Have we missed any inequalities constraint? • Yes! • If F = C, then F ≤ C+0.2is also valid (redundant). • Completeness Theorem: • Our constraint set is also complete in the inequality sense.

Putting Them Together Tools: LBFGS, TOMLAB, KNITRO, etc. Background Knowledge Constraints on P( S | Q ) Maximum Entropy Estimate Estimate P( S | Q ) Published Data Constraints on P( S | Q ) Public Information

Inevitable Questions: • Where do we get background knowledge? • Do we have to be very very knowledgeable? • For P (s | q) type of knowledge: • All useful knowledge is in the original data set. • Association rules: • Positive: Q  S • Negative: Q  ¬S, ¬Q  S, ¬Q  ¬S • Bound the knowledge in our study. • Top-K strongest association rules.

Knowledge about Individuals Alice: (i1, q1) Bob: (i4, q2) Charlie: (i9, q5) Knowledge 1: Alice has either s1 or s4. Constraint: Knowledge 1: Two people among Alice, Bob, and Charlie have s4. Constraint:

Evaluation • Implementation: • Lagrange multipliers: Constrained Optimization Unconstrained Optimization • LBFGS: solving the unconstrained optimization problem. • Pentium 3Ghz CPU with 4GB memory.

Privacy versus Knowledge Estimation Accuracy: KL Distance between P(MaxEnt) (S | Q) and P(Original) (S | Q).

Privacy versus # of QI attributes

Performance vs. Knowledge

Running Time vs. Data Size

Iteration vs. Data size

Conclusion • Privacy-MaxEnt is a systematic method • Model various types of knowledge • Model the information from the published data • Based on well-established theory. • Future work • Reducing the # of constraints • Vague background knowledge • Background knowledge about individuals

Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification