1 / 37

Background Knowledge Attack for Generalization based Privacy-Preserving Data Mining

Background Knowledge Attack for Generalization based Privacy-Preserving Data Mining. Discussion Outline. (sigmod08-4) Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification (kdd08-4) Composition Attacks and Auxiliary Information in Data Privacy

valora
Download Presentation

Background Knowledge Attack for Generalization based Privacy-Preserving Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Background KnowledgeAttackfor Generalization based Privacy-Preserving Data Mining

  2. Discussion Outline • (sigmod08-4)Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification • (kdd08-4) Composition Attacks and Auxiliary Information in Data Privacy • (vldb07-4) Privacy Skyline: Privacy with Multidimensional Adversarial Knowledge

  3. Anonymization techniques • Generalization & suppression • Consistency property: multiple occurrences of the same value are always generalized the same way. (all old methods and recent Incognito) • No consistency property (Mondrain) • Anatomy (Tao vldb06) • Permutation (Koudas ICDE07)

  4. Anonymization through Anatomy Anatomy: simple and effective privacy preservation

  5. Anonymization through permutation

  6. Background knowledge • K-anonymity • Attacker has access to public databases, i.e., quasi-identifier values of the individuals. • The target individual is in the released database. • L-diversity • Homogeneity attack • background knowledge about some individuals’ sensitive attribute values • T-closeness • The distribution of sensitive attribute in the overall table

  7. Type of background knowledge • Known facts • A male patient cannot have ovarian cancer • Demographical information • It is unlikely that a young patient of certain ethnic groups has heart disease • Some combination of the quasi-identifier values cannot entail some sensitive attribute values

  8. Type of background knowledge • Adversary-specific knowledge • target individual has no specific sensitive attribute value , e.g., Bob does not have flu • Sensitive attribute values of some other individuals, Joe, John, and Mike (as Bob’s neighbor) have flu • Knowledge about same-value family

  9. Some extension • Multiple sensitive values per individual • Flu \in Bob[S] • Basic implication (adopted in Martin ICDE07) cannot practically express the above --- |s|-1 basic implications are needed • Probabilistic knowledge vs. deterministic knowledge

  10. Identifier Quasi-Identifier (QI) Sensitive Attribute (SA) Data Sets how much adversaries can know about an individual’s sensitive attributes if they know the individual’s quasi-identifiers

  11. we need to measure P(SA|QI) Quasi-Identifier (QI) Sensitive Attribute (SA) Background Knowledge

  12. Background Knowledge: It’s rare for male to have breast cancer. Impact of Background Knowledge

  13. [Martin, et al. ICDE’07] first formal study of the effect of background knowledge on privacy-preserving

  14. Full identification information • Assumption • the attacker has complete information about individuals’ non-sensitive data Full identification information

  15. Rule based knowledge • Atom Ai • a predicate about a person and his/her sensitive values • tJack[Disease] = flusays that the Jack’s tuple has the value flu for the sensitive attribute Disease. • Basic implication • Background knowledge • formulated as conjunctions of k basic implications

  16. The idea • use k to bound the background knowledge, and compute the maximum disclosure of a bucket data set with respect to the background knowledge.

  17. [Bee-Chung, et al. VLDB’07] (vldb07-4) use a triple (l, k,m) to specify the bound of the background rather than a single k

  18. Introduction • [Martin, et al. ICDE’07] limitation of using a single number k to bound background knowledge • quantifying an adversary’s external knowledge by a novel multidimensional approach

  19. Problem formulation data owner has a table of data (denoted by D) data owner publishes the resulting release candidate D* S: a sensitive attribute s: a target sensitive value t: a target individual Pr(t has s | K, D*) • new bound specifies that • adversaries know lother people’s sensitive value; • adversaries know ksensitive values that the target does not have • adversaries know a group of m−1 people who share the same sensitive value with the target

  20. Theoretical framework

  21. [Wenliang, et al. SIGMOD’08] (sigmod08-4)

  22. Introduction • The impact of background knowledge: • How does it affect privacy? • How to measure its impact on privacy? • Integrate background knowledge in privacy quantification. • Privacy-MaxEnt: A systematic approach. • Based on well-established theories. maximum entropy estimate

  23. Challenges • What do we want to compute? • P( S | Q ), given the background knowledge and the published data set. • Directly computing P( S | Q ) is hard.

  24. Our Approach Consider P( S | Q )as variable x (a vector). Background Knowledge Constraints on x Solve x Published Data Constraints on x Most unbiased solution Public Information

  25. Maximum Entropy Principle • “Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and leads to a type of statistical inference which is called the maximum entropy estimate. It is least biased estimate possible on the given information.” — by E. T. Jaynes, 1957.

  26. The MaxEnt Approach Background Knowledge Constraints on P( S | Q ) Maximum Entropy Estimate Estimate P( S | Q ) Published Data Constraints on P( S | Q ) Public Information

  27. Entropy Because H(S | Q, B) = H(Q, S, B) – H(Q, B) Constraint should use P(Q, S, B) as variables

  28. Maximum Entropy Estimate • Let vector x = P(Q, S, B). • Find the value for x that maximizes its entropy H(Q, S, B), while satisfying • h1(x) = c1, …, hu(x) = cu : equality constraints • g1(x) ≤ d1, …, gv(x) ≤ dv : inequality constraints • A special case of Non-Linear Programming.

  29. Putting Them Together Tools: LBFGS, TOMLAB, KNITRO, etc. Background Knowledge Constraints on P( S | Q ) Maximum Entropy Estimate Estimate P( S | Q ) Published Data Constraints on P( S | Q ) Public Information

  30. Conclusion • Privacy-MaxEnt is a systematic method • Model various types of knowledge • Model the information from the published data • Based on well-established theory.

  31. [Srivatsava, et al. KDD’08] (kdd08-2)

  32. Introduction • reason about privacy in the face of rich, realistic sources of auxiliary information. • investigate the effectiveness of current anonymization schemes in preserving privacy when multiple organizationsindependently release anonymized data • present a composition attacks • an adversary uses independently anonymized releases to breach privacy

  33. Summary • What is background knowledge? • Probability-Based Knowledge • P (s | q) = 1. • P (s | q) = 0. • P (s | q) = 0.2. • P (s | Alice) = 0.2. • 0.3 ≤ P (s | q) ≤ 0.5. • P (s | q1) + P (s | q2) = 0.7 • Logic-Based Knowledge (proposition/ first order/ modal logic) • One of Alice and Bob has “Lung Cancer”. • Numerical data • 50K ≤ salary of Alice ≤ 100K • age of Bob ≤ age of Alice • Linked data • degree of a node • topology information • …. • Domain Knowledge • mechanism or algorithm of anonymizationfor data publication • independently released anonymized data by other organizations • And many many others ….

  34. Summary [Wenliang, et al. SIGMOD’08] • How to represent background knowledge? • Probability-Based Knowledge • P (s | q) = 1. • P (s | q) = 0. • P (s | q) = 0.2. • P (s | Alice) = 0.2. • 0.3 ≤ P (s | q) ≤ 0.5. • P (s | q1) + P (s | q2) = 0.7 • Logic-Based Knowledge (proposition/ first order/ modal logic) • One of Alice and Bob has “Lung Cancer”. • Numerical data • 50K ≤ salary of Alice ≤ 100K • age of Bob ≤ age of Alice • Linked data • degree of a node • topology information • …. • Domain Knowledge • mechanism or algorithm of anonymizationfor data publication • independently released anonymized data by other organizations • And many many others …. Rule-based [Martin, et al. ICDE’07] general knowledge framework too hard to give a unified framework and give a general solution [Raymond, et al. VLDB’07] [Srivatsava, et al. KDD’08]

  35. Summary • How to quantify background knowledge? • by the number of basic implications(association rules) • by a novel multidimensional approach • formulated as linear constraints • How one can reason about privacy in the presence of external knowledge? • quantify the privacy • quantify the degree of randomization required • quantify the precise effect of background knowledge [Martin, et al. ICDE’07] [Bee-Chung, et al. VLDB’07] [Wenliang, et al. SIGMOD’08] [Wenliang, et al. SIGMOD’08] [Charu ICDE’07] [Martin, et al. ICDE’07]

  36. Questions?Thanks to Zhiwei Li

More Related