290 likes | 317 Views
Learn about Privacy Preservation Data Publishing (PPDP) principles, Proximity Attack, (ε, m)-anonymity, algorithms, and experiments. Understand how to safeguard sensitive information with anonymization techniques while maintaining data utility.
E N D
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian
Outline • What is PPDP • Existing Privacy Principles • Proximity Attack • (ε, m)-anonymity • Determine εand m • Algorithm • Experiments and Conclusion
Privacy Preservation Data Publishing • A true story in Massachusetts, 1997 • GIC • 20 dollars • Governor Weld
PPDP • Privacy • Sensitive information of individuals should be protected in the published data • More anonymized data • Utility • The published data should be useful • More accurate data
PPDP • Anonymization Technique • Generalization • Specific value -> General value • Maintain the semantic meaning • 78256 -> 7825*, UTSA -> University, 28 -> [20, 30] • Perturbation • One value -> another random value • Huge information loss -> poor utility
PPDP • Example of Generalization
Some Existing Privacy Principles • Generalization • SA – Categorical • k-anonymity • l-diversity, (α, k)-anonymity, m-invariance, … • (c, k)-safety, Skyline-privacy • … • SA – Numerical • (k, e)-anonymity, Variance Control • t-closeness • δ-presence • …
Next… • What is PPDP • Existing Privacy Principles • Proximity Attack • (ε, m)-anonymity • Determine εand m • Algorithm • Experiments and Conclusion
(ε, m)-anonymity • I(t) • private neighborhood of tuple t • I(t) = [t.SA − ε, t.SA + ε] • I(t) = [t.SA·(1 − ε), t.SA·(1 + ε)] • P(t) • the risk of proximity breach of tuple t • P(t) = x / |G|
(ε, m)-anonymity • ε = 20 • I(t1) = [980, 1020] • x = 3, |G| = 4 • P(t1) = 3/4
(ε, m)-anonymity • Principle • Given a real value ε and an integer m ≥ 1, a generalized table T∗ fulfills absolute (relative) (ε,m)-anonymity, if P(t) ≤ 1/m for every tuple t ∈ T. • Larger ε and m mean stricter privacy requirement
(ε, m)-anonymity • What is the Meaning of m? • |G| ≥ m • The best situation is for any two tuples ti and tj in G, and • Similar to l-diversity when the equivalence class has l tuples with distinct SA values.
(ε, m)-anonymity • How to make tj.SA does not fall in I(ti)? • All tuples in G are sorted in ascending order of their SA values • | j – i | ≧ max{ |left(tj,G)|, |right(ti,G)| }
(ε, m)-anonymity • Let maxsize(G) = max∀t∈G { max{ |left(t,G)|, |right(t,G)| } } • | j – i | ≧ maxsize(G)
(ε, m)-anonymity • Partitioning • Ascending order of tuples in G according to SA values • Hash the ith tuple into the jth bucket using function j = (i mod maxsize(G))+1 • Thus, all tuples (SA values) in the same bucket do not fall into the neighborhood of each other.
(ε, m)-anonymity • (6, 2)-anonymity • Privacy is breached • P(t3)= ¾ >1/m =1/2 • Need partitioning • An ascending order is ready according to SA values • g = maxsize(G) = 2 • j = (i mod 2)+1 • New P(t3)= 1/2
Determine εand m • Given εand m • Check if an equivalence class G satisfies (ε, m)-anonymity • Theorem: G has at least one (ε, m)-anonymous generalization, iff • Scan the sorted tuples in G to find maxsize(G) • Predict whether G can be partitioned or not
Algorithm • Step 1: Splitting • Mondrain, ICDE 2006. • Splitting is only based on QI-attributes • Iteratively find median value of frequency sets on one selected QI-dimension to cut G into G1 and G2, and make sure G1 and G2 are legal to be partitioned.
Algorithm • Splitting ((6, 2)-anonymity) 10 40 20 25 50 30
Algorithm • Step 2: Partitioning • After step 1 stops • Check all G produced by splitting • Release directly if G satisfies (ε, m)-anonymity • Otherwise, Partitioning, and then release new buckets
Algorithm • Partitioning ((6, 2)-anonymity) 10 40 20 25 50 30
Next… • What is PPDP • Evolution of Privacy Preservation • Proximity Attack • (ε, m)-anonymity • determine εand m • algorithm • Experiments and Conclusion
Experiments • Real Database SAL http://ipums.org • Attributes are Age, Birthplace, Occupation and Income with domains [16,93], [1,710], [1,983], and [1k, 100k], respectively. • 500K tuples • Compare to a perturbation method (OLAP, SIGMOD 2005 )
Experiments - Utility • Use count query with workload = 1000
Conclusion • Discuss most of existing privacy principles in PPDP • Identify the proximity attack and propose (ε, m)-anonymity to prevent this attack • Verify that the method is effective and efficient experimentally