460 likes | 1.01k Views
Privacy and k-Anonymity. Guy Sagy November 2008 Seminar in Databases (236826) . Outline. Introduction k-Anonymity Generalization & Suppression MinGen – Theoretical Algorithm Mondrian – A greedy partition algorithm. What is Privacy ?.
E N D
Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)
Outline • Introduction • k-Anonymity • Generalization & Suppression • MinGen – Theoretical Algorithm • Mondrian – A greedy partition algorithm
What is Privacy ? • Society is experiencing exponential growth in the number and variety of data collections containing person-specific information. • Sharing these collected information is valuable both in research and business. Publishing the data may put person privacy in risk. • Objective: Maximize data utility while limiting disclosure risk to an acceptable level • Note : • There is no clear definition for disclosure and acceptable level • Not the traditional security of data e.g. access control, theft, hacking etc.
Example • For medical research (e.g., Gene, infection diseases) a hospital has some person-specific patient data which it wants to publish • It wants to publish such that: • Information remains practically useful • Identity of an individual cannot be determined • Adversary might inferthe secret/sensitive data from the published database
Example – cont. • The data contains: • Identifiers - {name, ssn} • Non-Sensitive data - {zip-code, nationality, age} • Sensitive data - { medical condition, salary, location }
Data leak! # Name Zip Age Nationality Voter List 1 John 13053 28 American 2 Bob 13067 29 American 3 Chris 13053 23 American Example – cont [SW02-A] Published Data Do we have a privacy violation ?
Name Address Date registered Party affiliation Date last voted Ethnicity Visit date Diagnosis Procedure Medication Total charge Zip Birthdate Gender Medical data Voter List Example – cont[SW02-A] • The Group Insurance Commission (GIC) in Massachusetts sold a believed to be anonymous data of state employees health. • Voter registration list for Cambridge Massachusetts – sold for 20$ • William Weld was governor of Massachusetts- • Lived in Cambridge Massachusetts • Six people had his particular birth date • Three of them were men • He was the only with 5-digit ZIP code. Quasi Identifier)QI)
k-Anonymity [SW02-A] • Change data in such a way that for each tuple in the resulting table there are at least (k-1) other tuples with the same value for the quasi-identifier – k-Anonymized table This is a 4-anonymized Table. Why ?
K-Anonymity – Formal Definition • RT - Released Table • (A1,A2,…,An) - Attributes • QIRT - Quasi Identifier • RT[QIRT] – Projection of RT on QIRT
K-Anonymity Example [SW02-B] Example of k-anonymity, where k=2 and QI={Country, Birth, Gender, ZIP}
K-Anonymity – The challenge • Theorem 1 in [SW02-B] claims :Let RT(A1,...,An) be a table, QIRT=(Ai,…, Aj) be the quasi-identifier associated with RT, Ai,…,AjA1,…,An, and RT satisfy k-anonymity. Then, each sequence of values in RT[Ax] appears with at least k occurrences in RT[QIRT] for x=i,…,j. • Can we use this property for easily building of a k-Anonymity table ? (Can we claim the opposite ?)(each sequence of values in RT[Ax] appears with at least k occurrences then the table is k-anonymity?)
How to create k-Anonymity ? • Generalization • Replace the original value by a semantically consistent but less specific value • Suppression • Data not released at all • Can be viewed as first level of generalization Generalization Suppression
ZIP Z3={*****} 130 Z2={130**} 1305 1306 Z1={1305*,1306*} 13053 13058 13063 13067 Z0={13053,13058,13063,13067} Nationality Age * * < 40 American Asian < 30 3* Canadian US Indian Japanese 28 29 36 35 Generalization & Hierarchies Z3 Z2 Z2 Z1 Z1 Z0 Z0
Generalization & Hierarchies • The number of generalized tables is : (DGHi = Maximum generalization level of Ai) (note, not all generalization creates a k-anonymity table)
K-minimal Generalizations • Intuition: The one that does not generalize the data more than needed (decrease in utility of the published dataset!) • K-minimal generalization: Tm is said to be a minimal generalization of RT if • Tm satisfies the k-anonymity requirement with respect to QIRT • Tz: RTTz ,Tz Tm, Tz satisfies the k-anonymity requirement with respect to QIRT Tz=Tm
2-minimal Generalizations There are many k-minimal anonymized tables – which one to pick? NOT a 2-minimal Generalization
K-minimal Generalizations • There are many k-minimal generalizations – which one is preferred then? • No clear and “correct” answer : • The one that creates min. distortion to data, where distortion • Normalized average equivalence class size metric • The one with min. suppression • Best support the research (less damaging the “interesting” attributes)
Algorithm for finding minimal generalization [SW02-B] • Theoretical Model (MinGen) • Store the set of all possible generalizations of RT over QI into allgens • Store from allgensall the tables which satisfied k-anonymity into protected • Define comparing measure score • From protected choose the table with best score
Algorithm for finding minimal generalization • The search space is exponential • The problem is NP-Hard! • We present one proposed algorithm[LDR06]-LeFevre, D.J. DeWitt, R. Ramakrishnan,2006 -Multi-dimensional algorithm (Mondrian)
Single Dimensional Partitioning • A single dimensional partitioning defines for each attribute Ai , a set of non overlapping single-dimensional intervals that cover DXi. Data Partitioning
Single Dimensional Partitioning Age 20 24 26 12 Areas of Partitioning 31 38 44 Zip Code 2120 2129 2130 2139 2140 2149 26
Multidimensional Partitioning • Assume all attributes are from discrete numeric domain (every set can be mapped to a one) • The domain of Ai is denoted by DXi • Each tuple can be presented as (v1,v2,…,vd)DX1 DX2… DXn • A multidimensional partitioning defines a set of multidimensional regions.
Multidimensional Partitioning – cont. Attributes = {ZipCode,Age)
Multidimensional Partitioning – Why is it good ? Voter Registration Data Patient Data
Age Sex Zipcode Disease Age Sex Zipcode Disease 25-26 Male 53710-11 Flu 25-28 Male 53710-11 Flu 25-26 Male 53710-11 Bronchitis 25-28 Male 53710-11 Bronchitis 27-28 Male 53710-11 Broken Arm 25-28 Male 53710-11 Broken Arm 27-28 Male 53710-11 Bronchitis 25-28 Male 53710-11 Bronchitis Multidimensional Partitioning –cont. Female 53712 Hepatitis 25-27 25-28 Female 53712 Hepatitis Female 53712 AIDS 25-27 25-28 Female 53712 AIDS Single Dimensional Multi Dimensional
Finding k-Anonymous Multidimensional Partitioning • Given a set P of unique (point,count), with points in d-dimensional space, is there a multidimensional partitioning for P such that: • For every region Ri, pRicount(p)k or pRicount(p) =0 (k-anonymity) • CAVG c (positive constant)?(average number of records in each partition) • This problem is NP-Complete • Proof : reduction from partition
35 40 45 50 55 60 65 70 50 55 Age 60 65 70 75 Weight 80 85 Mondrian - A Greedy Partitioning Algorithm [LDR06] Mondrian(partition) • if (no allowable multidimensional cut for partition) return : partition summary • else • dim choose dimension() • fs frequency set(partition, dim) • splitVal find median(fs) • lhs {t partition : t.dim splitVal} • rhs {t partition : t.dim > splitVal} • return Mondrian(rhs) Mondrian(lhs) k-anonymity, k = 3
Mondrian – Example[LDR06] Anonymizations for two attributes with a discrete normal distribution (= 25, = 2)
Mondrian Quality • By definition of k-Anonymity: • From Theorem 2 in [LeFevre et al. 06’]:The maximum number of points in any region (Ri) is 2d*(k-1)+m, where m is the maximum number of copy of any distinct point in P • For constant d,m,k - CAVG2*CAVG*
Piet Mondrian (1872-1944) (*) wikipedia
Bibliography • [SW02-A] “k-ANONYMITY: A Mode for Protecting privacy”, L. Sweeney,2002 • [SW02-B] “Achieving k-Anonymity Privacy Protection Using Generalization and Suppression”, L. Sweeney, 2002 • [LDR06] “Mondrian Multidimensional k-Anonymity”,K. LeFevre, D.J. DeWitt, R. Ramakrishnan,2006 • http://en.wikipedia.org/wiki/Piet_Mondrian • Presentations: • “Privacy In Databases”, B. Aditya Prakash • “K-Anonymity and Other Cluster-Based Methods”, Ge. Ruan