Survey of Privacy Protection for Medical Data

Survey of Privacy Protection for Medical Data Sumathie Sundaresan Advisor : Dr. Huiping Guo

Abstract • Expanded scientific knowledge, combined with the development of the net and widespread use of computers have increased the need for strong privacy protection for medical records. We have all heard stories of harassment that has resulted because of the lack of adequate privacy protection of medical records. • "...medical information is routinely shared with and viewed by third parties who are not involved in patient care .... The American Medical Records Association has identified twelve categories of information seekers outside of the health care industry who have access to health care files, including employers, government agencies, credit bureaus, insurers, educational institutions, and the media."

Methods • Generalization • k-anonymity • l-diversity • t-closeness • m-invariance • Personalized Privacy Preservation • Anatomy

Privacy preserving data publishing Microdata

Classification of Attributes • Key Attribute: • Name, Address, Cell Phone • which can uniquely identify an individual directly • Always removed before release. • Quasi-Identifier: • 5-digit ZIP code,Birth date, gender • A set of attributes that can be potentially linked with external information to re-identify entities • 87% of the population in U.S. can be uniquely identified based on these attributes, according to the Census summary data in 1991. • Suppressed or generalized

Classification of Attributes(Cont’d) • Sensitive Attribute: • Medical record, wage,etc. • Always released directly. These attributes is what the researchers need. It depends on the requirement.

Inference attack Published table An adversary Quasi-identifier (QI) attributes

Generalization • Transform the QI values into less specific forms generalize

Generalization • Transform each QI value into a less specific form A generalized table An adversary

K-Anonymity Sweeny came up with a formal protection model named k-anonymity • What is K-Anonymity? • If the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release. • Example. If you try to identify a man from a release, but the only information you have is his birth date and gender. There are k people meet the requirement. This is k-Anonymity.

Attacks Against K-Anonymity • Unsorted Matching Attack • This attack is based on the order in which tuples appear in the released table. • Solution: • Randomly sort the tuples before releasing.

Attacks Against K-Anonymity(Cont’d) • k-Anonymity does not provide privacy if: • Sensitive values in an equivalence class lack diversity • The attacker has background knowledge Homogeneity Attack A 3-anonymous patient table Background Knowledge Attack A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

l-Diversity • Distinct l-diversity • Each equivalence class has at least l well-represented sensitive values • Limitation: • Example. In one equivalent class, there are ten tuples. In the “Disease” area, one of them is “Cancer”, one is “Heart Disease” and the remaining eight are “Flu”. This satisfies 3-diversity, but the attacker can still affirm that the target person’s disease is “Flu” with the accuracy of 70%. A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

l-Diversity(Cont’d) • Entropy l-diversity • Each equivalence class not only must have enough different sensitive values, but also the different sensitive values must be distributed evenly enough. • Sometimes this maybe too restrictive. When some values are very common, the entropy of the entire table may be very low. This leads to the less conservative notion of l-diversity. • Recursive (c,l)-diversity • The most frequent value does not appear too frequently A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

Limitations of l-Diversity l-diversity may be difficult and unnecessary to achieve. • A single sensitive attribute • Two values: HIV positive (1%) and HIV negative (99%) • Very different degrees of sensitivity • l-diversity is unnecessary to achieve • 2-diversity is unnecessary for an equivalence class that contains only negative records • l-diversity is difficult to achieve • Suppose there are 10000 records in total • To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classes

Limitations of l-Diversity(Cont’d) l-diversity is insufficient to prevent attribute disclosure. Skewness Attack • Two sensitive values • HIV positive (1%) and HIV negative (99%) • Serious privacy risk • Consider an equivalence class that contains an equal number of positive records and negative records • l-diversity does not differentiate: • Equivalence class 1: 49 positive + 1 negative • Equivalence class 2: 1 positive + 49 negative l-diversity does not consider the overall distribution of sensitive values

Limitations of l-Diversity(Cont’d) l-diversity is insufficient to prevent attribute disclosure. Similarity Attack A 3-diverse patient table Conclusion Bob’s salary is in [3k,5k], which is relative low. Bob has some stomach-related disease. l-diversity does not consider semantic meanings of sensitive values

t-Closeness: A New Privacy Measure • Rationale A completely generalized table ExternalKnowledge Overall distribution Q of sensitive values

t-Closeness: A New Privacy Measure • Rationale A released table ExternalKnowledge Overall distribution Q of sensitive values Distribution Pi of sensitive values in each equi-class

t-Closeness: A New Privacy Measure • Rationale • Observations • Q should be public • Knowledge gain in two parts: • Whole population (from B0 to B1) • Specific individuals (from B1 to B2) • We bound knowledge gain between B1 and B2 instead • Principle • The distance between Q and Pi should be bounded by a threshold t. ExternalKnowledge Overall distribution Q of sensitive values Distribution Pi of sensitive values in each equi-class

How to calculate EMD • EMD for numerical attributes • Ordered-distance is a metric • Non-negative, symmetry, triangle inequality • Let ri=pi-qi, then D[P,Q] is calculated as:

Earth Mover’s Distance • Example • {3k,4k,5k} and {3k,4k,5k,6k,7k,8k,9k,10k,11k} • Move 1/9 probability for each of the following pairs • 3k->5k,3k->4k cost: 1/9*(2+1)/8 • 4k->8k,4k->7k,4k->6k cost: 1/9*(4+3+2)/8 • 5k->11k,5k->10k,5k->9k cost: 1/9*(5+6+4)/8 • Total cost: 1/9*27/8=0.375 • With P2={6k,8k,11k} , we can get the total cost is 0.167 < 0.375. This make more sense than the other two distance calculation method.

Motivating Example • A hospital keeps track of the medical records collected in the last three months. • The microdata table T(1), and its generalization T*(1), published in Apr. 2007. 2-diverse Generalization T*(1) Microdata T(1)

Motivating Example • Bob was hospitalized in Mar. 2007 2-diverse Generalization T*(1)

Motivating Example • One month later, in May 2007 Microdata T(1)

Motivating Example • One month later, in May 2007 • Some obsolete tuples are deleted from the microdata. Microdata T(1)

Motivating Example • Bob’s tuple stays. Microdata T(1)

Motivating Example • Some new records are inserted. Microdata T(2)

Motivating Example • The hospital published T*(2). 2-diverse Generalization T*(2) Microdata T(2)

Motivating Example • Consider the previous adversary. 2-diverse Generalization T*(2)

Motivating Example • What the adversary learns from T*(1). • What the adversary learns from T*(2). • So Bob must have contracted dyspepsia! • A new generalization principle is needed.

The critical absence phenomenon • We refer to such phenomenon as the critical absence phenomenon • A new generalization method is needed. Microdata T(2) What the adversary learns from T*(1)

Microdata T(2) Counterfeited generalization T*(2) The auxiliary relation R(2) for T*(2)

Generalization T*(1) Counterfeited Generalization T*(2) The auxiliary relation R(2) for T*(2)

m-uniqueness • A generalized table T*(j) is m-unique, if and only if • each QI-group in T*(j) contains at least m tuples • all tuples in the same QI-group have different sensitive values. A 2-unique generalized table

Signature • The signature of Bob in T*(1) is {dyspepsia, bronchitis} • The signature of Jane in T*(1) is {dyspepsia, flu, gastritis} T*(1)

The m-invariance principle • A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if • T*(1), …, T*(n) are m-unique, and • each individual has the same signature in every generalized table s/he is involved.

A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if • T*(1), …, T*(n) are m-unique, and • each individual has the same signature in every generalized table s/he is involved. Generalization T*(2) Generalization T*(1)

Motivation 1: Personalization Andy does not want anyone to know that he had a stomach problem Sarah does not mind at all if others find out that she had flu A 2-diverse table An external database

Motivation 2: SA generalization How many female patients are there with age above 30? 4 ∙ (60 – 30 ) / (60 – 20 ) = 3 Real answer: 1 An external database A generalized table

Motivation 2: SA generalization (cont.) Generalization of the sensitive attribute is beneficial in this case A better generalized table An external database

Personalized anonymity • We propose • a mechanism to capture personalized privacy requirements • criteria for measuring the degree of security provided by a generalized table

Guarding node Andy does not want anyone to know that he had a stomach problem He can specify “stomach disease” as the guarding node for his tuple The data publisher should prevent an adversary from associating Andy with “stomach disease”

Guarding node Sarah is willing to disclose her exact symptom She can specify Ø as the guarding node for her tuple

Guarding node Bill does not have any special preference He can specify the guarding node for his tuple as the same with his sensitive value

A personalized approach

Personalized anonymity A table satisfies personalized anonymity with a parameter pbreach Iff no adversary can breach the privacy requirement of any tuple with a probability abovepbreach If pbreach = 0.3, then any adversary should have no more than 30% probability to find out that: Andy had a stomach disease Bill had dyspepsia etc

Survey of Privacy Protection for Medical Data