1 / 44

Privacy-Preserving Data Publishing

Privacy-Preserving Data Publishing. Donghui Zhang Northeastern University. Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis. motivation. several agencies, institutions, bureaus, organizations make (sensitive) data involving people publicly available

libitha
Download Presentation

Privacy-Preserving Data Publishing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

  2. motivation • several agencies, institutions, bureaus, organizations make (sensitive) data involving people publicly available • termed microdata (vs. aggregated macrodata) used for analysis • often required and imposed by law • to protect privacy microdata are sanitized • explicit identifiers (SSN, name, phone #) are removed • is this sufficient for preserving privacy? • no! susceptible to link attacks • publicly available databases (voter lists, city directories) can reveal the “hidden” identity

  3. link attack example • looking for governor’s record • join the tables: • 6 people had his birth date • 3 were men • 1 in his zipcode • regarding the US 1990 census data • 87% of the population are unique based on (zipcode, gender, dob) • [Sweeney01]managed to re-identify the medical record of the governor of Massachussetts • MA collects and publishes sanitized medical data for state employees (microdata) left circle • voter registration list of MA (publicly available data) right circle

  4. Microdata

  5. Inference Attack Published table An adversary Quasi-identifier (QI) attributes

  6. k-anonymity [Samarati and Sweeney02] • Transform the QI values into less specific forms generalize

  7. Generalization • Transform each QI value into a less specific form A generalized table An adversary

  8. 35000 12000 14000 18000 25000 20000 26000 27000 33000 34000 52 24 43 56 22 40 21 36 37 41 23 Graphically… Alice Bob

  9. 35000 12000 14000 18000 25000 20000 26000 27000 33000 34000 52 24 43 56 22 40 21 36 37 41 23 Why not… How many people with age in [30, 50] contracted flu?

  10. k-anonymity How many people with age in [30, 50] contracted flu? generalization with low utility: answer less accurately: [0..3] generalization with high utility: answer queries more accurately: 2.

  11. k-anonymity with utility • Among all generalizations that enforce k-anonymity, we should maximize utility by minimizing the “rectangle” sizes! • Several measures. E.g. to minimize the maximal perimeter size of the rectangles.

  12. Mondrian [LDR06] Recursive half-plane partitioning, alternating dimensions. let k=2

  13. Mondrian [LDR06] Unbounded approximation ratio! let k=4

  14. Our contributions [DXT+07] • Proved that to find the optimal partitioning is NP-hard. • Proved that to find a partitioning with approximation ratio less than 1.25 is also NP-hard. • Provided three algorithms with tradeoffs in complexity and approximation ratio.

  15. Divide-And-Group (DAG) • Divide the space into square cells with proper size • Find a set of non-overlapping tiles of 2 x 2 cells to cover the points, such that each tile covers at least k points • Assign the rest of (uncovered) points to the nearest tile

  16. Min-MBR-Group (MMG) • For each point p, find the smallest MBR which covers at least k points including p • Find a set of non-overlapping MBRs from the result of previous step • Assign the points to the nearest MBR

  17. Nearest-Neighbor-Group (NNG) • For each point p, find the MBR which covers p and its k-1 nearest neighbors • Find a set of non-overlapping MBRs from the result of previous step • Assign the points to the nearest MBR

  18. Analysis

  19. Drawback of k-anonymity • In a QI group, if many records have the same sensitive attribute value... Quasi-identifier (QI) attributes Sensitive attribute If Bob is in this group, he must have pneumonia.

  20. l-diversity [ICDE06] • A QI-group with m tuples is l-diverse, iff each sensitive value appears no more than m /l times in the QI-group. • A table is l-diverse, iff all of its QI-groups are l-diverse. • The above table is 2-diverse. Quasi-identifier (QI) attributes Sensitive attribute 2 QI-groups

  21. What l-diversity guarantees • From an l-diverse generalized table, an adversary (without any prior knowledge) can infer the sensitive value of each individual with confidence at most 1/l A 2-diverse generalized table A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

  22. Problem with multi-publishing • A hospital keeps track of the medical records collected in the last three months. • The microdata table T(1), and its generalization T*(1), published in Apr. 2007. 2-diverse Generalization T*(1) Microdata T(1)

  23. Problem with multi-publishing • Bob was hospitalized in Mar. 2007 2-diverse Generalization T*(1)

  24. Problem with multi-publishing • One month later, in May 2007 Microdata T(1)

  25. Problem with multi-publishing • One month later, in May 2007 • Some obsolete tuples are deleted from the microdata. Microdata T(1)

  26. Problem with multi-publishing • Bob’s tuple stays. Microdata T(1)

  27. Problem with multi-publishing • Some new records are inserted. Microdata T(2)

  28. Problem with multi-publishing • The hospital published T*(2). 2-diverse Generalization T*(2) Microdata T(2)

  29. Problem with multi-publishing • Consider the previous adversary. 2-diverse Generalization T*(2)

  30. Problem with multi-publishing • What the adversary learns from T*(1). • What the adversary learns from T*(2). • So Bob must have contracted dyspepsia! • A new generalization principle is needed.

  31. m-invariance [SIGMOD07] • A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if • T*(1), …, T*(n) are m-unique, and • each individual has the same signature in every generalized table s/he is involved. • Explanation • m-unique: every QI group contains at least m tuples with different sensitive attributes • signature: all the sensitive attributes in the individual’s QI group.

  32. m-unique • A generalized table T*(j) is m-unique, if and only if • each QI-group in T*(j) contains at least m tuples • all tuples in the same QI-group have different sensitive values. A 2-unique generalized table

  33. Signature • The signature of Bob in T*(1) is {dyspepsia, bronchitis} • The signature of Jane in T*(1) is {dyspepsia, flu, gastritis} T*(1)

  34. The m-invariance principle • Lemma: if a sequence of generalized tables {T*(1), …, T*(n)} is m-invariant, then for any individual o involved in any of these tables, we have risk(o) <= 1/m

  35. The m-invariance principle • Lemma: let {T*(1), …, T*(n-1)} be m-invariant. {T*(1), …, T*(n-1), T*(n)} is also m-invariant, if and only if {T*(n-1), T*(n)} is m-invariant • Only T*(n - 1) is needed for the generation of T*(n). T*(1), T*(2), …, T*(n-2), T*(n-1), T*(n) Can be discarded

  36. Solution idea • Goal: Given T(n) and T*(n-1), create T*(n) such that {T*(n-1) and T*(n)} is m-invariant. • Idea: create counterfeits. • Optimization goal: to impose as little amount of generalization as possible.

  37. Microdata T(2) Counterfeited generalization T*(2) The auxiliary relation R(2) for T*(2)

  38. Generalization T*(1) Counterfeited Generalization T*(2) The auxiliary relation R(2) for T*(2)

  39. A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if • T*(1), …, T*(n) are m-unique, and • each individual has the same signature in every generalized table s/he is involved. Generalization T*(1) Generalization T*(2)

  40. A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if • T*(1), …, T*(n) are m-unique, and • each individual has the same signature in every generalized table s/he is involved. Generalization T*(1) Generalization T*(2)

  41. A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if • T*(1), …, T*(n) are m-unique, and • each individual has the same signature in every generalized table s/he is involved. Generalization T*(1) Generalization T*(2)

  42. In case of corruption… • If an adversary knows from Alice that she has bronchitis, he can conclude that Bob has dyspepsia. 2-diverse Generalization Microdata

  43. Anti-corruption publishing [ICDE08] • We formalized anti-corruption publishing, by modeling the degree of privacy preservation as a function of an adversary’s background knowledge. • We proposed a solution, by integrating generalization with • perturbation: switch selected records’ sensitive information. • stratified sampling: sample some records from each QI group.

  44. Summary • Introduced the problem of privacy-preserving publishing. • Two principles: • k-anonymity • l-diversity • Two extensions: • multi-publishing • corruption

More Related