320 likes | 449 Views
Probabilistic Inference Protection on Anonymized Data. Raymond Chi-Wing Wong (the Hong Kong University of Science and Technology) Ada Wai-Chee Fu (the Chinese University of Hong Kong) Ke Wang (Simon Fraser University) Yabo Xu (Sun Yat-sen University) Jian Pei (Simon Fraser University)
E N D
Probabilistic Inference Protection on Anonymized Data Raymond Chi-Wing Wong (the Hong Kong University of Science and Technology) Ada Wai-Chee Fu (the Chinese University of Hong Kong) Ke Wang (Simon Fraser University) Yabo Xu (Sun Yat-sen University) Jian Pei (Simon Fraser University) Philip S. Yu (Univerisity of Illinois at Chicago) Prepared by Raymond Chi-Wing Wong Presented by Raymond Chi-Wing Wong
Outline • Introduction • l-diversity • Background Knowledge • Proposed Model • Conclusion
Knowledge 2 I also know Alan with (Male, 41) Release the data set to public Simplified 2-diversity: to generate a data set such that each individual is linked to a sensitive value (e.g., Lung Cancer) with probability at most 1/2 1. l-diversity Bucketization In other words, P(Alan is linked to Lung Cancer) is at most 1/2. Combining Knowledge 1 and Knowledge 2, we can deduce that Alan is linked to Lung Cancer with probability=1/2. Knowledge 1 This dataset satisfies 2-diversity. QI Table Sensitive Table
Knowledge 2 I also know Alan with (Male, 41) Release the data set to public Simplified 2-diversity: to generate a data set such that each individual is linked to a sensitive value (e.g., Lung Cancer) with probability at most 1/2 1. l-diversity This can be obtained from statistical reports from the US department of Health and Human Services and other statistical data sources discussed in previous studies Bucketization Knowledge 3 QI Based Distribution Knowledge 1 This dataset satisfies 2-diversity. QI Table Sensitive Table
Knowledge 2 I also know Alan with (Male, 41) Release the data set to public Simplified 2-diversity: to generate a data set such that each individual is linked to a sensitive value (e.g., Lung Cancer) with probability at most 1/2 1. l-diversity It is more likely that a male patient is linked to Lung Cancer compared with a female patient. Bucketization Knowledge 3 QI Based Distribution Knowledge 1 Combining Knowledge 1, 2 and 3, we can deduce that Alan is linked to Lung Cancer with very high probability (much greater than 1/2). This dataset satisfies 2-diversity. Why? QI Table Sensitive Table
Knowledge 2 I also know Alan with (Male, 41) Release the data set to public Simplified 2-diversity: to generate a data set such that each individual is linked to a sensitive value (e.g., Lung Cancer) with probability at most 1/2 Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity Bucketization Knowledge 3 QI Based Distribution We need to formulate how to calculate the probability (e.g., P(Alan is linked to Lung Cancer) ) according to Knowledge 1, 2 and 3 Knowledge 1 Combining Knowledge 1, 2 and 3, we can deduce that Alan is linked to Lung Cancer with very high probability (much greater than 1/2). This dataset satisfies 2-diversity. QI Table Sensitive Table
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity We need to formulate how to calculate the probability (e.g., P(Alan is linked to Lung Cancer) ) according to Knowledge 1, 2 and 3
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity • Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. We need to formulate how to calculate the probability (e.g., P(Alan is linked to Lung Cancer) ) according to Knowledge 1, 2 and 3
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity • Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. • Challenge 2: The formula for this probability is not monotonic with respect to the A-group size. Most existing privacy studies involve some formulae which are monotonic. Thus, most existing algorithms (e.g., Incognito and Mondrian) rely on this monotonic property.
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 • Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. • Challenge 2: The formula for this probability is not monotonic with respect to the A-group size. Most existing privacy studies involve some formulae which are monotonic. Thus, most existing algorithms (e.g., Incognito and Mondrian) rely on this monotonic property.
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 • Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. • Challenge 2: The formula for this probability is not monotonic with respect to the A-group size. Related Work: There is a closely related work [LLZ09] for this problem. [LLZ09] T. Li, N. Li and J. Zhang, “Modeling and Integrating Background Knowledge in Data Anonymization”, ICDE 2009 [LLZ09] approximates the formula for this probability. Thus, there is no solid guarantee on the privacy protection.
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 • Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. • Challenge 2: The formula for this probability is not monotonic with respect to the A-group size. Contributions: We propose a condition. If this condition is satisfied, we canguarantee the privacy requirement (i.e., P(Alan is linked to Lung Cancer) ≤ 1/2 ) Besides, this condition can overcome Challenge 1 and Challenge 2. Specifically, (1) Computing the condition is computationally cheap, and (2) The condition involves a monotonic function on the A-group size.
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 • The major idea of the condition includes some simple calculations based on the statistics of an A-group • The size of the A-group (N) • The privacy requirement (r) • The global probabilities of each tuple in the A-group to a sensitive value Contributions: We propose a condition. If this condition is satisfied, we canguarantee the privacy requirement (i.e., P(Alan is linked to Lung Cancer) ≤ 1/2 ) Besides, this condition can overcome Challenge 1 and Challenge 2. Specifically, (1) Computing the condition is computationally cheap, and (2) The condition involves a monotonic function on the A-group size.
N Satisfied/Not Satisfied r Global probabilities Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 • The major idea of the condition includes some simple calculations based on the statistics of an A-group • The size of the A-group (N) • The privacy requirement (r) • The global probabilities of each tuple in the A-group to a sensitive value Condition Check If it is satisfied, we deduce that the privacy requirement is satisfied (e.g., P(Alan is linked to Lung Cancer) ≤ 1/2)
4. Conclusion • Background Knowledge • QI-based Probability Distribution • Two Challenges • Challenge 1: The formula for the probability is computationally expensive • Challenge 2: The formula is not monotonic • Proposed Condition • overcomes Challenge 1 and Challenge 2
Release the data set to public 1. l-diversity There is another way to prevent this linkage called Generalization. The following principle to be discussed can also be applied to Generalization. A way to prevent this linkage. Bucketization These two tuples form an anonymized group (A-group) GID = L1 These two tuples form another A-group. GID = L2
Release the data set to public 1. l-diversity Bucketization GID = L1 GID = L2 QI Table Sensitive Table
Release the data set to public 1. l-diversity Bucketization QI Table Sensitive Table
Knowledge 2 I also know Alan with (Male, 41) Release the data set to public 1. l-diversity Knowledge 1 Combining Knowledge 1 and Knowledge 2, we can deduce that Alan is linked to Lung Cancer.
Merging Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity P(an individual is linked to a sensitive value) ≤ 0.5 • Monotonicity • Consider two A-groups P(an individual is linked to a sensitive value) = 0.5 An A-group with GID = L1 An A-group “merged”from these two A-groups An A-group with GID = L2 P(an individual is linked to a sensitive value) = 0.4 The probability is monotonically decreasing when the size of the A-gourp increases.
Merging Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity It is possible that P(an individual is linked to a sensitive value) > 0.5 • Non-Monotonicity • Consider two A-groups P(an individual is linked to a sensitive value) = 0.5 An A-group with GID = L1 An A-group “merged”from these two A-groups An A-group with GID = L2 P(an individual is linked to a sensitive value) = 0.4 The probability is notmonotonically decreasing when the size of the A-gourp increases.
Knowledge 2 I also know Alan with (Male, 41) N Satisfied/Not Satisfied r Global probabilities Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 Knowledge 1 For the sake of illustration, we focus on attribute Gender only. Knowledge 3 QI Based Distribution Suppose we are interested in knowing whether P(Alan is linked to Lung Cancer) ≤ 1/2. 2 Condition Check 2 If it is satisfied, we deduce that the privacy requirement is satisfied (e.g., P(Alan is linked to Lung Cancer) ≤ 1/2) 0.1 0.003
N Satisfied/Not Satisfied r Global probabilities What is the condition check? In the condition check, there is an expression ceil in terms of N, r and global probabilities to compute. 2 Condition Check 2 0.1 0.003
What is the condition check? In the condition check, there is an expression ceil in terms of N, r and global probabilities to compute. • Theorem 1: If the condition is satisfied, then the privacy requirement is satisfied.
Theorem 2: Computing ceil can be done in O(1) time. This means that we overcome Challenge 1. Challenge 1: Calculating the probability is computationally expensive. • Theorem 3:ceil is a monotonically increasing function on N where N is the A-group size. This means that we overcome Challenge 2. Challenge 2: The formula for the original probability is not monotonic with respect to the A-group size.
ceil = (N-r)/fmax fmax(r-1)/(1-fmax) + (N-1) N Satisfied/Not Satisfied r Global probabilities What is the condition check? The greatest global probability fmax = max{f1, f2} = max{0.1, 0.003} = 0.1 The difference between the greatest global probability and the “current” global probability = 0 = 0.1 – 0.1 1 = fmax– f1 in terms of N, r and fmax. = 0.097 = 0.1 – 0.003 2 = fmax– f2 The condition is whether this difference 1 (and 2) is at most an expression ceil 2 Condition Check 2 f1 0.1 0.003 f2
ceil = (N-r)/fmax fmax(r-1)/(1-fmax) + (N-1) What is the condition check? The greatest global probability fmax = max{f1, f2} = max{0.1, 0.003} = 0.1 The difference between the greatest global probability and the “current” global probability = 0 = 0.1 – 0.1 1 = fmax– f1 = 0.097 = 0.1 – 0.003 2 = fmax– f2 The condition is whether this difference 1 (and 2) is at most an expression ceil • Theorem 1: If i≤ceil is satisfied, then the privacy requirement is satisfied.
Anonymization • The condition check gives hints for anonymization • Initially, each tuple forms an A-group. • Repeat the following until each A-group satisfies the condition. • If there is an A-group violating the condition, merge this A-group with some other A-group such that the “merged” A-group satisfies the condition.
Release the data set to public B.1.2 K-Anonymity Problem: to generate a data set such that each possible value appears at least TWO times. Two Kinds of Generalisations 1. ShatinNT 2. 16 July* “ShatinNT” causes LESS distortion than “16 July*” Question: how can we measure the distortion? This data set is 2-anonymous
HKG * NT Jan July Oct Feb KLN 16 July 21 Oct 8 Feb 29 Jan Fanling Mongkok Jordon Shatin B.1.2 K-Anonymity Measurement= 1/1=1.0 * Measurement= 2/2=1.0 Male Female Measurement= 1/2 =0.5 Conclusion: We propose a measurement of distortion of the modified/anonymized data.
HKG * NT Jan July Oct Feb KLN 16 July 21 Oct 8 Feb 29 Jan Fanling Mongkok Jordon Shatin B.1.2 K-Anonymity Measurement= 1/1=1.0 * Measurement= 2/2=1.0 Male Female Measurement= 1/2 =0.5 Can we modify the measurement? e.g. different weightings to each level