1 / 38

Privacy and k-Anonymity

Privacy and k-Anonymity. Guy Sagy November 2008 Seminar in Databases (236826) . Outline. Introduction k-Anonymity Generalization & Suppression MinGen – Theoretical Algorithm Mondrian – A greedy partition algorithm. What is Privacy ?.

Sophia
Download Presentation

Privacy and k-Anonymity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

  2. Outline • Introduction • k-Anonymity • Generalization & Suppression • MinGen – Theoretical Algorithm • Mondrian – A greedy partition algorithm

  3. What is Privacy ? • Society is experiencing exponential growth in the number and variety of data collections containing person-specific information. • Sharing these collected information is valuable both in research and business. Publishing the data may put person privacy in risk. • Objective: Maximize data utility while limiting disclosure risk to an acceptable level • Note : • There is no clear definition for disclosure and acceptable level • Not the traditional security of data e.g. access control, theft, hacking etc.

  4. Example • For medical research (e.g., Gene, infection diseases) a hospital has some person-specific patient data which it wants to publish • It wants to publish such that: • Information remains practically useful • Identity of an individual cannot be determined • Adversary might inferthe secret/sensitive data from the published database

  5. Example – cont. • The data contains: • Identifiers - {name, ssn} • Non-Sensitive data - {zip-code, nationality, age} • Sensitive data - { medical condition, salary, location }

  6. Data leak! # Name Zip Age Nationality Voter List 1 John 13053 28 American 2 Bob 13067 29 American 3 Chris 13053 23 American Example – cont [SW02-A] Published Data Do we have a privacy violation ?

  7. Name Address Date registered Party affiliation Date last voted Ethnicity Visit date Diagnosis Procedure Medication Total charge Zip Birthdate Gender Medical data Voter List Example – cont[SW02-A] • The Group Insurance Commission (GIC) in Massachusetts sold a believed to be anonymous data of state employees health. • Voter registration list for Cambridge Massachusetts – sold for 20$ • William Weld was governor of Massachusetts- • Lived in Cambridge Massachusetts • Six people had his particular birth date • Three of them were men • He was the only with 5-digit ZIP code. Quasi Identifier)QI)

  8. Example-2 – AOL (2006)

  9. Example2 – cont.

  10. Example-3

  11. k-Anonymity [SW02-A] • Change data in such a way that for each tuple in the resulting table there are at least (k-1) other tuples with the same value for the quasi-identifier – k-Anonymized table This is a 4-anonymized Table. Why ?

  12. K-Anonymity – Formal Definition • RT - Released Table • (A1,A2,…,An) - Attributes • QIRT - Quasi Identifier • RT[QIRT] – Projection of RT on QIRT

  13. K-Anonymity Example [SW02-B] Example of k-anonymity, where k=2 and QI={Country, Birth, Gender, ZIP}

  14. K-Anonymity – The challenge • Theorem 1 in [SW02-B] claims :Let RT(A1,...,An) be a table, QIRT=(Ai,…, Aj) be the quasi-identifier associated with RT, Ai,…,AjA1,…,An, and RT satisfy k-anonymity. Then, each sequence of values in RT[Ax] appears with at least k occurrences in RT[QIRT] for x=i,…,j. • Can we use this property for easily building of a k-Anonymity table ? (Can we claim the opposite ?)(each sequence of values in RT[Ax] appears with at least k occurrences then the table is k-anonymity?)

  15. K-Anonymity – The challenge – cont. No !!!

  16. How to create k-Anonymity ? • Generalization • Replace the original value by a semantically consistent but less specific value • Suppression • Data not released at all • Can be viewed as first level of generalization Generalization Suppression

  17. ZIP  Z3={*****} 130 Z2={130**} 1305 1306 Z1={1305*,1306*} 13053 13058 13063 13067 Z0={13053,13058,13063,13067} Nationality Age * * < 40 American Asian < 30 3* Canadian US Indian Japanese 28 29 36 35 Generalization & Hierarchies Z3 Z2 Z2 Z1 Z1 Z0 Z0

  18. Generalization & Hierarchies • The number of generalized tables is : (DGHi = Maximum generalization level of Ai) (note, not all generalization creates a k-anonymity table)

  19. 19

  20. K-minimal Generalizations • Intuition: The one that does not generalize the data more than needed (decrease in utility of the published dataset!) • K-minimal generalization: Tm is said to be a minimal generalization of RT if • Tm satisfies the k-anonymity requirement with respect to QIRT • Tz: RTTz ,Tz Tm, Tz satisfies the k-anonymity requirement with respect to QIRT  Tz=Tm

  21. 2-minimal Generalizations There are many k-minimal anonymized tables – which one to pick? NOT a 2-minimal Generalization

  22. K-minimal Generalizations • There are many k-minimal generalizations – which one is preferred then? • No clear and “correct” answer : • The one that creates min. distortion to data, where distortion • Normalized average equivalence class size metric • The one with min. suppression • Best support the research (less damaging the “interesting” attributes)

  23. Algorithm for finding minimal generalization [SW02-B] • Theoretical Model (MinGen) • Store the set of all possible generalizations of RT over QI into allgens • Store from allgensall the tables which satisfied k-anonymity into protected • Define comparing measure score • From protected choose the table with best score

  24. Algorithm for finding minimal generalization • The search space is exponential • The problem is NP-Hard! • We present one proposed algorithm[LDR06]-LeFevre, D.J. DeWitt, R. Ramakrishnan,2006 -Multi-dimensional algorithm (Mondrian)

  25. Single Dimensional Partitioning • A single dimensional partitioning defines for each attribute Ai , a set of non overlapping single-dimensional intervals that cover DXi. Data Partitioning

  26. Single Dimensional Partitioning Age 20 24 26 12 Areas of Partitioning 31 38 44 Zip Code 2120 2129 2130 2139 2140 2149 26

  27. Multidimensional Partitioning • Assume all attributes are from discrete numeric domain (every set can be mapped to a one) • The domain of Ai is denoted by DXi • Each tuple can be presented as (v1,v2,…,vd)DX1 DX2… DXn • A multidimensional partitioning defines a set of multidimensional regions.

  28. Multidimensional Partitioning – cont. Attributes = {ZipCode,Age)

  29. Multidimensional Partitioning – Why is it good ? Voter Registration Data Patient Data

  30. Age Sex Zipcode Disease Age Sex Zipcode Disease 25-26 Male 53710-11 Flu 25-28 Male 53710-11 Flu 25-26 Male 53710-11 Bronchitis 25-28 Male 53710-11 Bronchitis 27-28 Male 53710-11 Broken Arm 25-28 Male 53710-11 Broken Arm 27-28 Male 53710-11 Bronchitis 25-28 Male 53710-11 Bronchitis Multidimensional Partitioning –cont. Female 53712 Hepatitis 25-27 25-28 Female 53712 Hepatitis Female 53712 AIDS 25-27 25-28 Female 53712 AIDS Single Dimensional Multi Dimensional

  31. Finding k-Anonymous Multidimensional Partitioning • Given a set P of unique (point,count), with points in d-dimensional space, is there a multidimensional partitioning for P such that: • For every region Ri, pRicount(p)k or pRicount(p) =0 (k-anonymity) • CAVG c (positive constant)?(average number of records in each partition) • This problem is NP-Complete • Proof : reduction from partition

  32. 35 40 45 50 55 60 65 70 50 55 Age 60 65 70 75 Weight 80 85 Mondrian - A Greedy Partitioning Algorithm [LDR06] Mondrian(partition) • if (no allowable multidimensional cut for partition) return : partition  summary • else • dim choose dimension() • fs frequency set(partition, dim) • splitVal find median(fs) • lhs  {t  partition : t.dim  splitVal} • rhs  {t  partition : t.dim > splitVal} • return Mondrian(rhs) Mondrian(lhs) k-anonymity, k = 3

  33. Mondrian – Example[LDR06] Anonymizations for two attributes with a discrete normal distribution (= 25, = 2)

  34. Mondrian Quality • By definition of k-Anonymity: • From Theorem 2 in [LeFevre et al. 06’]:The maximum number of points in any region (Ri) is 2d*(k-1)+m, where m is the maximum number of copy of any distinct point in P • For constant d,m,k - CAVG2*CAVG*

  35. Piet Mondrian (1872-1944) (*) wikipedia

  36. Privacy – Last Example

  37. Bibliography • [SW02-A] “k-ANONYMITY: A Mode for Protecting privacy”, L. Sweeney,2002 • [SW02-B] “Achieving k-Anonymity Privacy Protection Using Generalization and Suppression”, L. Sweeney, 2002 • [LDR06] “Mondrian Multidimensional k-Anonymity”,K. LeFevre, D.J. DeWitt, R. Ramakrishnan,2006 • http://en.wikipedia.org/wiki/Piet_Mondrian • Presentations: • “Privacy In Databases”, B. Aditya Prakash • “K-Anonymity and Other Cluster-Based Methods”, Ge. Ruan

More Related