370 likes | 565 Views
Background Knowledge Attack for Generalization based Privacy-Preserving Data Mining. Discussion Outline. (sigmod08-4) Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification (kdd08-4) Composition Attacks and Auxiliary Information in Data Privacy
E N D
Background KnowledgeAttackfor Generalization based Privacy-Preserving Data Mining
Discussion Outline • (sigmod08-4)Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification • (kdd08-4) Composition Attacks and Auxiliary Information in Data Privacy • (vldb07-4) Privacy Skyline: Privacy with Multidimensional Adversarial Knowledge
Anonymization techniques • Generalization & suppression • Consistency property: multiple occurrences of the same value are always generalized the same way. (all old methods and recent Incognito) • No consistency property (Mondrain) • Anatomy (Tao vldb06) • Permutation (Koudas ICDE07)
Anonymization through Anatomy Anatomy: simple and effective privacy preservation
Background knowledge • K-anonymity • Attacker has access to public databases, i.e., quasi-identifier values of the individuals. • The target individual is in the released database. • L-diversity • Homogeneity attack • background knowledge about some individuals’ sensitive attribute values • T-closeness • The distribution of sensitive attribute in the overall table
Type of background knowledge • Known facts • A male patient cannot have ovarian cancer • Demographical information • It is unlikely that a young patient of certain ethnic groups has heart disease • Some combination of the quasi-identifier values cannot entail some sensitive attribute values
Type of background knowledge • Adversary-specific knowledge • target individual has no specific sensitive attribute value , e.g., Bob does not have flu • Sensitive attribute values of some other individuals, Joe, John, and Mike (as Bob’s neighbor) have flu • Knowledge about same-value family
Some extension • Multiple sensitive values per individual • Flu \in Bob[S] • Basic implication (adopted in Martin ICDE07) cannot practically express the above --- |s|-1 basic implications are needed • Probabilistic knowledge vs. deterministic knowledge
Identifier Quasi-Identifier (QI) Sensitive Attribute (SA) Data Sets how much adversaries can know about an individual’s sensitive attributes if they know the individual’s quasi-identifiers
we need to measure P(SA|QI) Quasi-Identifier (QI) Sensitive Attribute (SA) Background Knowledge
Background Knowledge: It’s rare for male to have breast cancer. Impact of Background Knowledge
[Martin, et al. ICDE’07] first formal study of the effect of background knowledge on privacy-preserving
Full identification information • Assumption • the attacker has complete information about individuals’ non-sensitive data Full identification information
Rule based knowledge • Atom Ai • a predicate about a person and his/her sensitive values • tJack[Disease] = flusays that the Jack’s tuple has the value flu for the sensitive attribute Disease. • Basic implication • Background knowledge • formulated as conjunctions of k basic implications
The idea • use k to bound the background knowledge, and compute the maximum disclosure of a bucket data set with respect to the background knowledge.
[Bee-Chung, et al. VLDB’07] (vldb07-4) use a triple (l, k,m) to specify the bound of the background rather than a single k
Introduction • [Martin, et al. ICDE’07] limitation of using a single number k to bound background knowledge • quantifying an adversary’s external knowledge by a novel multidimensional approach
Problem formulation data owner has a table of data (denoted by D) data owner publishes the resulting release candidate D* S: a sensitive attribute s: a target sensitive value t: a target individual Pr(t has s | K, D*) • new bound specifies that • adversaries know lother people’s sensitive value; • adversaries know ksensitive values that the target does not have • adversaries know a group of m−1 people who share the same sensitive value with the target
[Wenliang, et al. SIGMOD’08] (sigmod08-4)
Introduction • The impact of background knowledge: • How does it affect privacy? • How to measure its impact on privacy? • Integrate background knowledge in privacy quantification. • Privacy-MaxEnt: A systematic approach. • Based on well-established theories. maximum entropy estimate
Challenges • What do we want to compute? • P( S | Q ), given the background knowledge and the published data set. • Directly computing P( S | Q ) is hard.
Our Approach Consider P( S | Q )as variable x (a vector). Background Knowledge Constraints on x Solve x Published Data Constraints on x Most unbiased solution Public Information
Maximum Entropy Principle • “Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and leads to a type of statistical inference which is called the maximum entropy estimate. It is least biased estimate possible on the given information.” — by E. T. Jaynes, 1957.
The MaxEnt Approach Background Knowledge Constraints on P( S | Q ) Maximum Entropy Estimate Estimate P( S | Q ) Published Data Constraints on P( S | Q ) Public Information
Entropy Because H(S | Q, B) = H(Q, S, B) – H(Q, B) Constraint should use P(Q, S, B) as variables
Maximum Entropy Estimate • Let vector x = P(Q, S, B). • Find the value for x that maximizes its entropy H(Q, S, B), while satisfying • h1(x) = c1, …, hu(x) = cu : equality constraints • g1(x) ≤ d1, …, gv(x) ≤ dv : inequality constraints • A special case of Non-Linear Programming.
Putting Them Together Tools: LBFGS, TOMLAB, KNITRO, etc. Background Knowledge Constraints on P( S | Q ) Maximum Entropy Estimate Estimate P( S | Q ) Published Data Constraints on P( S | Q ) Public Information
Conclusion • Privacy-MaxEnt is a systematic method • Model various types of knowledge • Model the information from the published data • Based on well-established theory.
[Srivatsava, et al. KDD’08] (kdd08-2)
Introduction • reason about privacy in the face of rich, realistic sources of auxiliary information. • investigate the effectiveness of current anonymization schemes in preserving privacy when multiple organizationsindependently release anonymized data • present a composition attacks • an adversary uses independently anonymized releases to breach privacy
Summary • What is background knowledge? • Probability-Based Knowledge • P (s | q) = 1. • P (s | q) = 0. • P (s | q) = 0.2. • P (s | Alice) = 0.2. • 0.3 ≤ P (s | q) ≤ 0.5. • P (s | q1) + P (s | q2) = 0.7 • Logic-Based Knowledge (proposition/ first order/ modal logic) • One of Alice and Bob has “Lung Cancer”. • Numerical data • 50K ≤ salary of Alice ≤ 100K • age of Bob ≤ age of Alice • Linked data • degree of a node • topology information • …. • Domain Knowledge • mechanism or algorithm of anonymizationfor data publication • independently released anonymized data by other organizations • And many many others ….
Summary [Wenliang, et al. SIGMOD’08] • How to represent background knowledge? • Probability-Based Knowledge • P (s | q) = 1. • P (s | q) = 0. • P (s | q) = 0.2. • P (s | Alice) = 0.2. • 0.3 ≤ P (s | q) ≤ 0.5. • P (s | q1) + P (s | q2) = 0.7 • Logic-Based Knowledge (proposition/ first order/ modal logic) • One of Alice and Bob has “Lung Cancer”. • Numerical data • 50K ≤ salary of Alice ≤ 100K • age of Bob ≤ age of Alice • Linked data • degree of a node • topology information • …. • Domain Knowledge • mechanism or algorithm of anonymizationfor data publication • independently released anonymized data by other organizations • And many many others …. Rule-based [Martin, et al. ICDE’07] general knowledge framework too hard to give a unified framework and give a general solution [Raymond, et al. VLDB’07] [Srivatsava, et al. KDD’08]
Summary • How to quantify background knowledge? • by the number of basic implications(association rules) • by a novel multidimensional approach • formulated as linear constraints • How one can reason about privacy in the presence of external knowledge? • quantify the privacy • quantify the degree of randomization required • quantify the precise effect of background knowledge [Martin, et al. ICDE’07] [Bee-Chung, et al. VLDB’07] [Wenliang, et al. SIGMOD’08] [Wenliang, et al. SIGMOD’08] [Charu ICDE’07] [Martin, et al. ICDE’07]