170 likes | 183 Views
This study delves into the complexities of privacy in data sharing, highlighting the risks associated with identity and sensitive information disclosure. A framework based on statistical decision theory is proposed, offering a comprehensive assessment of privacy risks through disclosure rules and loss functions. The research presents an algorithm to minimize privacy risk, emphasizing the importance of risk estimation and sensitivity analysis. The text explores how data can be linked and the implications for privacy, ultimately aiming to enhance privacy protection in both public and private sectors.
E N D
Beyond k-Anonimity:A Decision Theoretic Frameworkfor Assessing Privacy Risk M.Scannapieco, G.Lebanon, M.R.Fouad and E.Bertino
Introduction • Release of data • Private organizations can benefit from sharing data with others • Public organizations see data as a value for the society • Privacy preservation • Data disclosure can lead to economic damages, threats to national security, etc. • Regulated by law in both private and public sectors
Two Facets of Data Privacy • Identity disclosure • Uncontrolled data release: even presence of identifiers • Anonymous data release: identifiers suppressed, but no control on possible linking with other sources
Linkage of Anonymous Data T1 QUASI-IDENTIFIER T2
Two Facets of Data Privacy (cont.) • Sensitive information disclosure • Once identity disclosure occurs, the loss due to such disclosure depends on how much sensitive are the related data • Data sensitivity is subjective • E.g.: for women the age is in general more sensitive than for men
Our proposal • A framework for assessing privacy risk that takes into accounts both facets of privacy • based on statistical decision theory • Definition and analysis of: disclosure policies modelled by disclosure rules and several privacy risk functions • Estimated risk as an upper-bound of true risk and realted complexity analysis • Algorithm for finding the disclosure rule minimizing the privacy risk
Disclosure rules • A disclosure rule is a function that maps a record to a new record in which some attributes may have been suppressed Zj= The j-th attribute is suppressed otherwise
Loss function • Let be the side information used by the attacker in the identification attempt • The loss function Measures the loss incurred by disclosing the data (z) due to possible identification based on • Empirical distribution p associated with records x1…xn
Risk Definition • The risk of the disclosure rule in the presence of the side information is the average loss of disclosing x1…xn :
Putting the pieces together so far… • An hypothetical attacker performs an indentification attempt on a disclosed record y=(x) on the basis of a side information , that can be a dictionary • The dictionary is used to link y with some entry present in the dictionary • Example: • y has the form (name, surname,phone#), is a phone book • if all attributes revealed, it is likely y linked with one entry • If phone# suppressed (or missing) y may or may not be linked to a single entity, depending on the popularity of (name, surname)
Risk formulation • Let’s decompose the loss function into an identification part and into a sensitivity part • Identification part: formalized by the random variable Z otherwise
Risk formulation (cont.) • Sensitivity part: • where higher value indicate higher sensitivity • Therefore the loss is:
Risk formulation (cont.) • Risk:
Disclosure Rule vs. Privacy Risk • Suppose that true is the true attacker’s dictionary which is publicly available and that * is the actual database starting from which data will be published • Under the following assumptions: • true contains more records than* (* <= true ) • The non- in true will be more limited than the non- in * Theorem: If θ* contains records that correspond to x1, . . . ,xn and θ*<=θtrue, then: R(, θtrue)<= R(, θ*)
Disclosure Rule vs. Privacy Risk (cont.) • The theorem proves that the true risk is bounded by R(, θ*) • Under the hypothesis that the distribution underlying factorizes into a product form Theorem: The rule that minimizes the risk *=arg min R(, θ) can be found in O(nNm) computation
K-anonimity • K anonimity is SIMPLY a special case of our framework in whcih: • θtrue=T • is a costant • is underspecified • Our framework underlies some questionable hypotheses of k-anonimity!!!
Conclusions • New framework for privacy risk taking into account sensitivity • Risk estimation as an upperbound for the true privacy risk • Efficient algorithm for risk computation • K-anonimity generalization