Privacy in Database s

Privacy in Databases Umur Türkay 2006103319

Outline • Defining Privacy • Optimization Problem • First-Cut Solution (k-anonymity) • Second-Cut Solution (l-diversity) • Decision Problem • First-Cut (Query-View Security) • Second-Cut (View Safety)

Defining Privacy in DB Publishing If the attacker uses legitimate methods, - can she infer the data I want to keep private? - how can I keep some data private while publishing useful info? Decision Problem Optimization Problem   Attacker ExternalKnowledge Modify Data V1 V2   Alice Secret

Need for Privacy in DB publishing • Alice is a owner of person-specific data • Public health agency, Telecom provider, Financial Organization • The person-specific data contains • Attribute values which can uniquely identify an individual • { zip-code, gender, date-of-birth } or/and {name} or/and {SSN} • sensitive information corresponding to individuals • medical condition, salary, location • Great demand for sharing of person-specific data • Medical research, new telecom applications • Alice wants to publish this person-specific data s.t. • Information remains practically useful • Identity of the individual cannot be determined  Modify Data 

The Optimization Problem Motivating Example Secret: Alice wants to publish hospital data, while thecorrespondence between name & disease stays private  Modify Data 

Attacker’s Knowledge: Voter registration list # Name Zip Age Nationality 1 John 13067 45 US 2 Paul 13067 22 US 3 Bob 13067 29 US 4 Chris 13067 23 US The Optimization Problem Motivating Example (continued) Published Data: Alice publishes data without the Name  Modify Data 

The Optimization Problem Motivating Example (continued) Published Data: Alice publishes data without the Name  Modify Data  Attacker’s Knowledge: Voter registration list Data Leak !

The Optimization Problem Source of the Problem Even if we do not publish the individuals:• There are some fields that may uniquely identify some individual Quasi Identifier • The attacker can use them to join with other sources and identify the individuals

# Zip Age Nationality Condition 1 130** < 40 * Heart Disease 2 130** < 40 * Heart Disease 3 130** < 40 * Cancer 4 130** < 40 * Cancer 4-anonymous Table The Optimization Problem First-Cut Solution: k-Anonymity Instead of returning the original data:• Change the data such that for each tuple in the results there are at least k-1 other tuples with the same value for the quasi-identifier e.g. Original Table 2-anonymous Table

The Optimization Problem > k-Anonymity Generalization & Suppression Different ways of modifying data:• Randomization• Data-Swapping … • GeneralizationReplace the value with a less specific but semantically consistent value • Suppression Do not release a value at all  Modify Data 

# Zip Age Nationality Condition # # Zip Zip Age Age Nationality Nationality Condition Condition 1 130** < 40 * Heart Disease 1 1 130** 13053 < 30 < 40 American * Heart Disease Heart Disease 2 130** < 40 * Heart Disease 2 2 130** 13067 < 30 < 40 American * Heart Disease Heart Disease 3 130** < 40 * Cancer 3 3 130** 13053 3* < 40 Asian * Cancer Cancer 4 130** < 40 * Cancer 4 4 130** 13067 3* < 40 Asian * Cancer Cancer The Optimization Problem > k-Anonymity Generalization Hierarchies • Generalization Hierarchies: Data owner defines how values can be generalized Zip Age Nationality 3 *  < 40 130 * 2 < 30 3* American Asian 1305 1306 1 13053 13058 13063 13067 28 29 36 37 Brazilian US Indian Japanese 0 • Table Generalization: A table generalization is created by generalizing all values in a column to a specific level of generalization e.g.2-anonymization

The Optimization Problem > k-Anonymity k-minimal Generalizations • There are manyk-anonymizations. Which to pick?The ones that do not generalize the data more than needed k-minimal Generalization: A k-anonymization that is not a generalization of another k-anonymization   2-minimal Generalization 2-minimal Generalization e.g.  Non-minimal2-anonymization

The Optimization Problem > k-Anonymity k-minimal Distortions • There are manyk-minimal generalizations. Which to pick?The ones the create the minimum distortion to the data k-minimal Distortion: A k-minimal generalization that has the least distortion  Current level of generalization for attribute i Max level of generalization for attribute i attrib i Distortion D = Number of attributes e.g. 0 2 2 2 1 1 D = ( ) / 3 = 0.56 D = ( ) / 3 = 0.5 + + + + 3 3 2 3 3 2

The Optimization Problem > k-Anonymity Complexity & Algorithms Search Space:• Number of generalizations =  (Max level of generalization for attribute i + 1) attrib i If we allow generalization to a different level for each value of an attribute: • Number of generalizations = #tuples  (Max level of generalization for attribute i + 1) attrib i Problem is NP-hard! See paper for: • Naïve Brute force algorithm • Heuristics: Datafly,  - Argus

The Optimization Problem > k-Anonymity k-Anonymity Drawbacks k-Anonymity alone does not provide privacy if:• Sensitive attributes lack diversity • Attacker has background knowledge

The Optimization Problem > k-Anonymity k-Anonymity Attack Example  Original Data The attacker knows: • About quasi-identifiers: • Other background knowledge: Japanese have low incidence of heart disease

Data Leak ! The Optimization Problem > k-Anonymity k-Anonymity Attack Example 4-anonymization Umeko has Viral Infection! Bob has Cancer!

The Optimization Problem Second-Cut Solution: l-Diversity Return a k-anonymization with the additional property that:• For each distinct value of the quasi-identifier there exist l different values for the sensitive attributes

The Optimization Problem > l-Diversity l-Diversity Example 3-diversified Attack does not work! Umeko has Viral Infectionor Cancer Bob has Viral Infectionor Cancer or Heart Disease

The Decision Problem Moving from practice to theory… • k-anonymity & l-diversity make it harder for the attacker to figure out private associations… • … but they still give away some knowledge & they do not give any guarantees on the amount of data being disclosed • Alice wants to publish some views of her data and wants to know: • Do her views disclose some sensitive data? • If she adds a new view, will there be an additional data disclosure?  Views V1 V2 

The Decision Problem Motivating Example Secret: Alice wants to keep the correlationbetween Name & Condition secret S = (name, condition)  V1 V2  Published Views: Alice publishes the views V1 = (zip, name) V2 = (zip, condition)

Condition Heart Disease Ronaldo Viral Infection Data Leak ! The Decision Problem Motivating Example Attackers Knowledge: Before seeing the views:(assuming he knows the domain)  Ronaldo V1 V2  After seeing the views: V1 V2 

x1 = 1/2 x2 = 1/2 x3 = 1/2 x4 = 1/2 • Attacker assigns a probability to each tuple The Decision Problem > Model for attacker’s knowledge Probability of possible tuples • Domain of possible values for all attributes: D = {Bob, Mary} • Set of possible tuples of relation R (e.g. cooksFor):

The Decision Problem > Model for attacker’s knowledge Probability of possible Databases • This implies a probability for each possible database instance: x1 = 1/2 1 – x2 = 1/2  = 1/16 1 – x4 = 1/2 1 – x3 = 1/2 1 – x1 = 1/2 16possibleinstances x2 = 1/2  = 1/16 1 – x4 = 1/2 1 – x3 = 1/2 x1 = 1/2 x2 = 1/2  = 1/16 1 – x4 = 1/2 1 – x3 = 1/2

The Decision Problem > Model for attacker’s knowledge Probability of possible Secrets • This implies a probability for each possible secret value: Probability that secret S(y) :- R(x,y) equals s = {(Bob)} Sum of probabilities of instances that can return this query result 3 P[S(I) = s]= 16 Similarly for probability that view V equals v: P(V(I) = v)

The Decision Problem > Model for attacker’s knowledge Prior & Posterior Probability • Prior Probability: Probability before seeing the view instance 3 P[S(I) = {(Bob)}]= Secret S(y) :- R(x,y) 16 • Posterior Probability: Probability after seeing the view instance View V(x) :- R(x,y) If V(I) = {(Mary)} P[S(I) = {(Bob)} | V(I) = {(Mary)}]= P[S(I) = {(Bob)} AND V(I) = {(Mary)}] 1/16 = P[S(I) = {V(I) = {(Mary)}] 3/16

The Decision Problem Query-View Security • A query S is secure w.r.t. a set of views V if • for any possible answer s to S & for any possible answer v to V: • P[S(I) = s] = P[S(I) = s | V(I) = v] PriorProbability PosteriorProbability Intuitively,if some possible answer to S becomes more or less possible after publishing the views V, then S is not secure w.r.t. V

The probability distribution does not affect the security of a query The Decision Problem From Probabilities to Logic • A possible tuple t is a critical tuple if • for some possible instance I: • Q[I] Q[I – {t}] Query resultin presence of t Query resultin absence of t Intuitively,critical tuples are those of interest to the query • A query S is secure w.r.t. a set of views V iff: • crit(S)  crit(V) = 

The Decision Problem Example of Non-Secure Query Previous Example Revisited: Secret S(y) :- R(x,y) Non-Secure Query S View V(x) :- R(x,y) Critical Tuples for S: crit(S) Critical Tuples for V: crit(V)   e.g. S({(Mary,Mary)}  S{}

The Decision Problem Example of Secure Query Example 2: Secret S(x) :- R(x,’Mary’) Secure Query S View V(x) :- R(x,’Bob’) Critical Tuples for S: crit(S) Critical Tuples for V: crit(V)  =

The Decision Problem Example of Secure Query Example 2 revisited using probabilistic definition of security: Secret S(x) :- R(x,’Mary’) Secure Query S View V(x) :- R(x,’Bob’) = P[S(I) = {(Mary)] = 4/16 P[S(I) = {(Mary)} | V(I) = {(Bob)}] = 1/4

The Decision Problem Properties of Query-View Security • Reflexivity • Is S is secure w.r.t. V, V is secure w.r.t. S • No obscurity • view definitions, secret query and schema are not concealed • Instance Independence • If S is secure w.r.t. V even if the underlying database changes • Probability Distribution Independence • If S and V are monotone queries • Domain Independence • If S is secure w.r.t V for a domain D0 such that |D0| <= n(n+1), • then S is secure w.r.t. V for all Domains D where |D0| <= n(n+1) • Complexity of query-view security • P2 - complete

The Decision Problem Prior Knowledge • Prior knowledge • other than domain D and probability distribution P • e.g. key or foreign key constraint or • Represented as a Boolean query K over the instance • Query view security • P[S(I) = s | K(I)]  P[S(I)=s | V(I) = v  K(I)]

The Decision Problem Measuring Disclosure • The query-view security is very strong • rules out most of the views in practical usage as insecure • The applications are ready to tolerate some disclosures • Disclosure examples: • Positive disclosure “Bob” has “Cancer” • Negative disclosure “Umeko” does not have “Heart Disease” • Measure of Positive disclosure: • Leak(S,V)  sup ( P[sS(I) | v V(I)] - P[s S(I)] ) / P[sS(I)] • Disclosure is minute if: • leak(S,V) << 1 s, v

The Decision Problem Query-View Security Drawbacks • Tuples are modeled as mutually independent • This is not the case in presence of constraints(e.g foreign key constraints) • Modeling prior or external knowledge • Boolean predicate does not suffice • Conjunctive queries only is restrictive • Guarantees are instance-independent • There may not be a privacy breach given the current instance

The Decision Problem More general setting • Alice has a database D which conforms to schema S. • D satisfies a set of constraints . • V is a set of views over D. • Model attacker’s belief as probability distribution • Views and queries are defined using UCQ • Alice wants to publish an additional view N. • Does view N provide any new information to the Attacker about the answer to query Q?

Motivating Example (w/o Constraints) Secret: Alice wants to hide the reviewer of paper P1 S(r) :- RP(r, ‘P1’)  V1 V2 Published Views: New Additional Views:  V1(r) :-RC(r, c) V2(c) :-RC(r, c) N1(r, c) :-RC(r, c) N2(c, p) :-CP(c, p) New views reveal nothing about the secret

Motivating Example (with Constraint 1) Published Views: New Additional Views: V1(r) :-RC(r, c) V2(c) :-RC(r, c) N1(r, c) :-RC(r, c) N2(c, p) :-CP(c, p) Data disclosuredepends on theconstraints Constraint 1:Papers assigned to a committee can only be reviewed by committee members rp RP(r,p)  c RC(r,c)CP(c,p) Possible secretswith new views:

Motivating Example (with Constraint 2) Published Views: New Additional Views: V1(r) :-RC(r, c) V2(c) :-RC(r, c) N1(r, c) :-RC(r, c) N2(c, p) :-CP(c, p) Data disclosuredepends on theconstraints Constraint 1:Papers assigned to a committee can only be reviewed by committee members Constraint 2:Each paper has exactly 2 reviewers Possible secretswith new views:

Motivating Example (different instance) Published Views: New Additional Views: V1(r) :-RC(r, c) V2(c) :-RC(r, c) N1(r, c) :-RC(r, c) N2(c, p) :-CP(c, p) Data disclosuredepends on theinstance Constraint 1:Papers assigned to a committee can only be reviewed by committee members New views reveal nothing about the secret,since any subset of the reviewers in V1 may review paper ‘P1’

Probabilities Revisited: Plausible Secrets • In order to allow correlation of tuples, the attacker assigns probabilities to the plausible secrets (outcomes for query S that are possible given the published views) e.g. in previous example with constraint 1 & secret S(r) :- RP(r, ‘P1’) Published Views: Plausible Secrets: V1(r) :-RC(r, c) V2(c) :-RC(r, c) Any subset of V1 e.g. P1 = 3/8 P2 = 1/8 P3 = 2/8 P4 = 2/8 Pi = 0, i > 4 …

The Decision Problem Possible Worlds • This induces a probability distribution on the set of possible worlds (possible instances that satisfy the constraints & the published views) Published Views: Plausible Secrets: V1(r) :-RC(r, c) V2(c) :-RC(r, c) Any subset of V1 e.g. …

The Decision Problem Possible Worlds • This induces a probability distribution on the set of possible worlds (possible instances that satisfy the constraints & the published views) Possible Worlds where S = {(R1)}: Published Views: Plausible Secrets: PG1 V1(r) :-RC(r, c) V2(c) :-RC(r, c) Any subset of V1 e.g. P1 = 3/8 P2 = 1/8 P3 = 2/8 PG2 P4 = 2/8 Pi = 0, i > 4 … …

Probability Distribution on Possible Worlds • This induced probability distribution can be: General: Sum of probabilities ofpossible worlds for any secret value s is equal to the probability of S = s Possible Worlds if S = {(R1)}: Published Views: Plausible Secrets: PG1 V1(r) :-RC(r, c) V2(c) :-RC(r, c) Any subset of V1 e.g. P1 = 3/8 P1 = 3/8 + P2 = 1/8 P3 = 2/8 PG2 P4 = 2/8 + Pi = 0, i > 4 … …

Probability Distribution on Possible Worlds • This induced probability distribution can be: Equiprobable: Each of the possible worlds for any secret value s is equally probable (i.e. equal to the probability of S = s / # of possible worlds for s) Possible Worlds if S = {(R1)}: Published Views: Plausible Secrets: PG1 V1(r) :-RC(r, c) V2(c) :-RC(r, c) Any subset of V1 P1 = 3/8 = P2 = 1/8 P3 = 2/8 PG2 P4 = 2/8 = Pi = 0, i > 4 … …

Privacy in Database s