590 likes | 697 Views
Personal identity protection solutions in the presence of low copy number fields. EUCCONET Record Linkage Workshop 15 th to 17 th June 2011. Dr Kerina Jones HIRU, Swansea University. Overview.
E N D
Personal identity protection solutions in the presence of low copy number fields EUCCONET Record Linkage Workshop 15th to 17th June 2011 Dr Kerina Jones HIRU, Swansea University
Overview • Nature of the issue – when anonymisation (de‑identification) may not be enough to protect privacy • Risks associated low copy number fields • Origins and descriptions of the kinds of solutions in use • Utility versus privacy • Current viewpoints
Anonymised (de-identified) data • Linked-anonymised: separation of the common identifiers from the clinical data, with capability to re-join via a key • Pseudonymised and anonymously-linked data: replacement of common identifiers with unique anonymous identifier and measures to prevent capability to re-join • Unlinked-anonymised data: permanent removal of common identifiers and no capability to re-join R I S K O F R E - I D
Nature of the issue • Not focussed on - • Using the linkage key to the demographics • Cracking the anonymisation or encryption codes • How robust the anonymisation is • It’s about - • Risk of re-identification of individuals in an anonymised dataset due to their records being present as unique combinations of variables or in low copy number • Can be accidental or intentional
Risk issues • Presence of unusual variables • Rare conditions • Extremes of age • Multiple births – triplets, etc. • Large families • Minority groups • Unusual combinations • Increases with increasing number of variables in dataset
Risk areas • Authorised: • Data access and analysis • Data sharing between individuals/organisations • Release of results • Data publication Unauthorised – for example: • Security breach by intruder - intentional • Loss of data, release of wrong data - accidental
Origins • Privacy preservation in anonymised datasets • 3 main origins: • Database community – database management • Cryptographers – cryptographic protocols • Statistical disclosure control (SDC) – national statistics • Variety of techniques • Often parallel developments – similar in outcome • Fusion of ideas • ‘Re-identification science’
Why is it needed? • Reasons for privacy preservation in anonymised datasets • Demonstrated to be relatively easy to re-identify individuals from some anonymous datasets • 87% of people in the US have a unique combination of ZIP code, birth date and gender • Netflix – anonymous customer movie recommendations • AOL anonymous internet searches • Sweeney – re-identification of a US Governor after he had stated his confidence in a de-identified health dataset • Clearly, removal of commonly-recognised identifiers is not enough to prevent re-identification
Why is it needed? • Linkage attack: using publicly-available information in combination with an anonymised dataset to attempt to re-identify individuals • Prosecutor risk – the re-identification of a given individual • Journalist risk – the re-identification of any individual • Marketer risk – the re-identification of as many individuals as possible • Purposeful attempts to prove it can be done • Risks to data linkage units: legislation, litigation, cost and reputation
Some definitions • Suppression: removal of certain variables or records from a dataset • Aggregation/Generalisation: grouping data items (such as ages) into bands • Encryption: transforming via an algorithm • Masking: obscuring values – functionally similar to original • Perturbation: introducing noise into a dataset • Data swapping: exchange of values between records • Synthetic data: generated to retain certain statistics
Privacy preservation methods Vary with access model: • Restricted data • Altered data - Important for researchers to know • Data views - Cannot take away • Meta-data • Results/test statistics only
Privacy preservation methods • Methods to quantify anonymity in a de-identified dataset: • k-anonymisation • Early work by Samarati and Sweeney (1998) • k-anonymisation – in a k-anonymised dataset a given record cannot be distinguished from at least (k - 1) other records • For example in a dataset where k = 3, a given record will be identical to at least 2 other records • Minimum value of k is 2 • Higher values of k are considered less risky
Privacy preservation algorithms • Algorithms to achieve the desired level of k-anonymisation in microdata • Set a threshold for k depending on perception of risk • Ubiquitous rule or a rule of thumb • Specific to a particular dataset • Vary according to levels of trust • Differ depending on dataset destination • Linked to risk appetite of organisation/unit
Argus algorithm • User specification on level of generalisation • Criticisms – • Only checks for low copy numbers at 2 and 3 • There may be sensitive combinations at copy level 4 • Checking all combinations would be computationally challenging (1996) • Unable to offer solution quality guarantees
Datafly algorithm • Not limited to equivalence classes of 2 or 3 • Criticisms – • Distortions and generalisations not necessarily k-minimal • Makes crude decisions on generalisation and suppression • Unable to offer solution quality guarantees
MinGen algorithm • Designed to provide minimal distortion • Deliver maximum quality • Criticisms – • Impractical for large datasets • Inefficient
Other k-anonymisation algorithms • Numerous more k-anonymisation algorithms • General pattern – develop, criticise, improve, etc. • Problem of non-global changes, local recodes • Different levels of generalisation on different variables • Loss of data due to suppression or over-generalisation • Introduction of other measures alongside k • l-diversity - designed to avoid inference of sensitive values • t-closeness – enhances l-diversity
Globally optimal method • Globally optimal k-anonymity method • Optimal Lattice Anonymisation (OLA) • Produces a lattice of solutions using different generalisation strategies • Comparison with other known k-anonymisation algorithms • Uses a set of metrics to measure information loss and for evaluation
Metrics Metrics – • Precision (Prec): a measure of the loss of precision due to generalisation • Discernability Metric (DM): assigns a penalty to each record relative to the number of records identical to it • Non-uniform entropy: a measure to quantify differences in loss of information when generalising a given variable whose distribution is different between datasets • E.g. gender is 60M:40F in dataset1, 1M:99F in dataset2
Generalisation Suppression OLA illustration k-minimal node
OLA method • k-minimal node is the lowest node on a given generalisation strategy that satisfies k-anonymisation • Generalisation strategy – the systematic approach taken to k-anonymise the dataset, e.g. successive banding on age until k is reached • If a node in a strategy is k-anonymous, all nodes above it in same strategy will be k-anonymous • Therefore nodes above, in the same strategy, are discarded as they have greater information loss
How OLA operates How OLA operates – • A threshold for k is set • A maximum level for suppression (MaxSup) is chosen • Find: A binary search for all k-anonymous nodes • Discard: Only the k-anonymous nodes at lowest height in the lattice (on a given generalisation strategy) are retained • Compare: All the k-anonymous nodes are compared in terms of information loss (using the metrics) and the optimal node is chosen
How OLA operates • Evaluation found it to be better in terms of efficiency and information loss than some other k-anonymisation algorithms • Limitations – • It is possible that an optimal solution will not be found • Works on principle that suppression is better than generalisation • Based on metrics that are monotonic with respect to generalisation strategies • Does it effectively solve attack strategies?
Angles of attack • Linkage attack: using publicly-available information in combination with an anonymised dataset to attempt to re‑identify individuals • Homogeneity attack: can occur where the value for a sensitive attribute in an anonymised dataset is the same for a number of records • Can occur in equivalence classes • Disclosure by being in the class • Additional metrics to assess vulnerability
Other types of algorithm • Other types of algorithm – SUDA and SUDA2 • Special Uniques Detection Algorithm • ONS in London and Australian Bureau of Statistics • Set of algorithms and software system • Special unique – • A record which is unique on coarser-grained variables is more risky than one unique on fine-grained variables • Unique on a set of variables and also unique on a sub-set of those variables
SUDA and SUDA2 • Takes into account Minimal Sample Uniques (MSUs) – the size and number of sub-sets within the dataset that are themselves unique • Use this information to estimate underlying risk • Recognises that some of the metrics used with other algorithms, such measures of distance in a dataset may not work for categorical variables • SUDA2 improves on SUDA methodologically to be more computationally efficient
PARAT • Privacy Analytics Risk Assessment Tool (PARAT) • Electronic Health Information Laboratory – Ontario • http://www.privacyanalytics.ca/products/products.html • Windows based application is compatible with a number of databases • Risk based approach to de-identification • NB – will only be applicable for certain data linkage unit models
Using PARAT • Select variables at risk of re-identification • These can be ranked for importance • Ranking used in de-identification process • Balance risk and data utility
Using PARAT • Set the acceptable re‑identification risk • User input • Level of trust • Accounting for nature of dataset
Using PARAT • Carry out the risk assessment for prosecutor, journalist and marketer • Risk is high (>0.2 for all these) • Many potential uniques
Using PARAT • Automatically de‑identifies the data • Suppression and generalisation to reduce risk to acceptable level • Before dataset is made available to researcher
Evaluation of PARAT • Which models does it apply to? • Repositories/DL units that prepare data views? • Repositories/DL units that release linked datasets? • DL units that provide links, but data comes from providers? • Federated queries/distributed systems via a co-ordination centre? • Others?
NEMO Numerical Evaluation of Multiple Outputs (NEMO) • SQL-based algorithm • Counts unique and low-copy number records • Allows the judicious application of suppression and/or aggregation • Project-by-project basis • Can apply at dataview and at results stages • Also – may use sequential analysis to limit views
Privacy in free text Free-text/Narrative text - • Review – Automatic de-identification of textual documents in the electronic health record: a review of recent research (Meystreet al, 2010, Uni of Utah) • Categorised two main approaches – • Pattern matching – rule-based via constructed dictionaries • Machine learning – data-mining techniques using training of algorithms • Some use a combination to improve efficiency
Privacy in free text • Performance assessment: • Recall (Sensitivity) – proportion of health information identified compared to total • Precision (Positive predictivity) – proportion of true positives among all the terms identified • Fallout (False positive rate) – proportion of non-QI terms mistakenly identified as QIs
Challenges • Some challenges – • Words such as ‘brown’, ‘grey’ or ‘white’ may be names or adjectives • Drug names can be mistaken for person names and omitted • Time consuming to generate dictionaries • Domain-specific knowledge needed • Computationally challenging • Particular concerns and sensitivities around free-text data
Privacy in geodata • Studying the relationships between health and environment • Residential Anonymous Linking Fields (RALFs) in SAIL • Common disclosure control practice is to aggregate and/or suppress values in population areas of specified size • Some methods – distortions, loss of spatial relationships • Risk-based approaches using uniqueness thresholds to manage the risk of re-identification
Privacy in outputs and results Privacy preservation at output of results and/or publication • Differs with model – • Repositories/DL units that prepare data views? • Menu-driven query servers/meta-data views? • Federated queries/distributed systems via a co-ordination centre? • Repositories/DL units that release linked datasets? • DL units that provide links, but data comes from providers?
Privacy in outputs and results Privacy preservation at output of results and/or publication • Types of outputs • Descriptive statistics, distributions, mean, median, SD, etc. • Tables of values • Contingency tables • Plots – scatter, box, bar charts • Single statistics, regression coefficients
Privacy preservation models • SAIL model - repository providing views only • Scrutinise the results before release • Numerical Evaluation of Multiple Outputs • Always conduct manual review • No results can leave SAIL without assessment – release is dependent on authorisation • Risk remaining – alteration of results post-release
Privacy preservation models • Australian Bureau of Statistics (ABS) • Confidentialised Unit Record Files (CURFs) • Removal of name and address, etc. • Control on level of detail • Changes to some values • Addition of noise to continuous variables • Some degree of suppression • Limited access to sensitive variables, company names, etc
Privacy preservation models • Stringently confidentialised data can be released on CD-ROM • Data enclave – limited access to more detailed data – secure on‑site facility - trusted researcher status • Remote Analysis Server (RAS) • User does not access the data themselves • Queries submitted in SAS, SPSS or STATA • Results checked for confidentiality and sent to user
Privacy preservation models Protection of privacy in ABS Remote Analysis System – • CSIRO (Commonwealth Scientific and Industrial Research Organisation) – Privacy Preserving Analytics • Uses risk mitigation in outputs of descriptive and inferential statistics, e.g. • Limiting extreme values in EDA output • Replacement of table cell counts with Correspondence Analysis (CA) of cell counts • Limited transformations allowed and covariance matrix not supplied in linear regression
Stated advantages of RAS model Stated advantages of ABS Remote Analysis System – • No information loss due to data perturbation • No need for special statistics to deal with perturbed data • Can be easier to confidentialise the output than the data • Fitted models on RAS should be better than on confidentialised data • Less risky as researcher only receives confidentialised output, not records • Users can be given different levels of access