570 likes | 609 Views
Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester. Overview. CAPRI –who we are / what we do SDC – some basics SD Risk Assessment and Microdata General Concepts Our Approach SD Risk Assessment and Aggregate Data General Concepts
E N D
Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester
Overview • CAPRI –who we are / what we do • SDC – some basics • SD Risk Assessment and Microdata • General Concepts • Our Approach • SD Risk Assessment and Aggregate Data • General Concepts • Our Approach • Statistical Disclosure and the Grid
Confidentiality And PRIvacy group www.ccsr.ac.uk/capri University of Manchester
Purpose To investigate the Confidentiality and Privacy issues that arise from the collection, dissemination and analysis of data.
Multidisciplinary Approach • Mark Elliot, Knowledge and Data Engineering • Kingsley Purdam, Politics and Information Society • Anna Manning, Data Mining and HPC • Elaine Mackey, Social Policy • Duncan Smith, Statistics and Stochastic Systems • Karen McCullagh, the Law and Social Policy
Associate Members in Manchester C S: Alan Rector, John Gurd, Len Freeman, Adel Taweel. Computation: John Keane. Psychology: Karen Lander, Lee Wickham. Medicine: Iain Buchan. Manchester Computing Centre: Stephen Pickles. Law: Joseph Jakaneli, John Harris.
Research Programmes The Social and Political Aspects of Confidentiality and Privacy The Detection of Risky Records: Special Uniqueness The Disclosure risk issues posed by the Grid High Performance Computing and statistical Disclosure Medical Records: Clinical E-Science Framework The SAMDIT methodology: Data Monitoring Centre
Consultancy ONS Census Social Survey Neighbourhood statistics US Census Bureau Australian Bureau of Statistics Statistics New Zealand
Sub Fields • Disclosure risk assessment. • Disclosure control methodology. • Analytical validity. • Microdata and Aggregate data. • Business and Personal data. • Intentional and Consequential data
Our General Approach:The SAMDIT method • Scenario Analysis (Elliot and Dale 1999) • Metric Development • Implementation • Testing
The Microdata Disclosure Risk Problem:An Example Identification file Name Address Sex Age .. Sex Age .. Income .. .. Target file Target variables ID variables Key variables
Risk Assessment methods • File Level • Population Uniqueness e.g Bethlehem(1990), Samuels(1998) • DIS; Skinner and Elliot(2002) • Record level • Statistical modelling (Fienberg and Makov 1998, Skinner and Holmes 1998) • Computational Search Elliot et al (2002)
Data Intrusion Simulation • Uses microdata set (or table) itself to estimate risk - no population data. • An estimate of the probability of a correct match (given a unique match). • Special method: sub-sampling and re-sampling. • General method: derivation from the equivalence class structure.
The DIS Method Remove a small number of records Microdata sample
The DIS Method II Copy back a random number of the removed records (at a probability equivalent to the original sampling fraction)
The DIS Method III Match the removed fragment against the truncated microdata file
Validation • Empirical validation studies comparing with the results obtained using population data: Empirical results: No bias and small error. Elliot (2001) • Mathematical proof: Skinner and Elliot (2002).
Pr(cm|um) for 2% sample with basic key (age sex marital status)
Levels of Risk Analysis • DIS • Works at the file level • Very good for comparative analyses • e.g. SAMs
Levels of Risk Analysis • Record level risk is important • Variations in risk topography • Risky records
Special Uniques • Original concept • Counterintuitive geographical effect, indicated two types of sample uniques. • Random and Special • Special • Epidemiological peculiarity • Random • Effect of sampling and variable definition
Special Uniques • Changing definition: • Sample uniques which remain unique despite geographical aggregation • Sample uniques which remain unique through any variable aggregation • Sample uniques on subset of key variables • Dichotomy to Dimension
Minimal Sample Unique • A set of sample unique set of variable values • for which no subset is also unique.
Risk Signatures: combinations of minimal uniques • Example • Unique pairs 0 • Unique triples 5 • Unique fourfolds 1 • Unique fivefolds 3 • Unique sixfolds 0 • Unique sevenfolds 0 • ………
Special Uniques • Problem: how to look at all the variables? • File may contain hundreds • Even with scenario keys individual records can contain hundreds of minimal sample uniques • Combinatorial explosion
HIPERSTAD Projects • Funded by ESRC, ONS and EPSRC • Use of high performance computing • Enables comprehensive analysis of patterns of uniqueness within each record • Has allowed investigation of more complex grading systems
Risk Signatures II • Allow grading and classification of records • Differential treatment • Low impact high efficacy disclosure control
Combining DIS and SUDA • A heuristic method for combining the two methods to provide a per record matching confidence has proved very effective • ONS evaluation studies show that combined method picks out high probability risk very well
SUDA software • Available free under licence • Used at ONS, ABS and Stats new Zealand
Introduction • Measurement of Disclosure Risk is an important precursor for its control • Intruder/scenario based metrics are better than abstract ones • Such metrics are available for microdata but not for aggregate data
Overview • Overview of the issues and introducing the method on a conceptual level • Details of the algorithms • Ongoing and Future Work
The Issues • Aggregate data is usually 100% data, so measures based on identification disclosure and sampling are meaningless • A better approach is to evaluate what can be inferred through attribute disclosure
The Approach • Rather than assess the risk of actual attribute disclosure we propose estimating the probability of producing a potentially disclosive table, which wedefine as any table containing at least one zero • The method/measure we propose can be applied to: • Single tables • Groups of tables • Unperturbed and perturbed tables • Unpublished tables
The Bounds Problem • In a general sense any set of tables can be viewed as a set of bounds on the full table. For example if we release two one way frequency tables:
The Bounds Problem We are effectively releasing the marginals to a two-way frequency table where the entire joint distribution has been suppressed
The cells in the joint distribution can be expressed as a set of bounds (or ranges of feasible values)
The Subtraction – Attribution Probability (SAP) Method • The risk associated with a table release depends on the set of tables jointly, rather than on the individual tables. • SAP can be used on single tables, groups of tables, perturbed or unperturbed tables. • Bounds are calculated and then the probability of an intruder producing one or more upper bounds of zero by subtractingk random individuals from the table is calculated • The output can be set for user defined levels of k
Original cell counts can be recovered from the marginal tables
Subtraction • We consider that an intruder might have knowledge of the relevant population, as well as information in the table release • We assume (at least initially) that the intruder has perfect knowledge of k randomly selected individuals
Single exact tables • The lower / upper bounds are equal to the published counts • The probability of an intruder recovering at least one zero by subtracting known individuals is found by calculating Hypergeometric probabilities and applying the inclusion / exclusion principle
The marginal probability of observing all individuals in a cell is calculated for each individual cell, and the sum is added to a total (initially zero) • The marginal probability of observing all individuals in a pair of cells is calculated for each pair of cells, and subtracted from the total • The marginal probability of observing all individuals in a ‘triple’ of cells is calculated for each triple of cells, and added to the total • And so on, until we have considered the table total, or all subsequent probabilities are zero
For example, For k = 3 and the following table (and not showing zero probability terms),
Confidentiality and the Grid • What new data possibilities does the Grid provide and what confidentiality implications do they have? • How could the Grid (or a Grid) be used to enable disclosure risk assessment and control? • How could a grid enable a data intruder? • What are the possibilities and issues provided by remote access?