130 likes | 319 Views
WP 15 Experience of using a Post Randomisation Method (PRAM) at ONS. Christine Bycroft, Katherine Merrett Office for National Statistics, UK. Outline. What is PRAM Why we needed to adapt the PRAM method Adapted PRAM Methodology Disclosure risks Effect on Data Quality Conclusions.
E N D
WP 15Experience of using a Post Randomisation Method (PRAM) at ONS Christine Bycroft, Katherine Merrett Office for National Statistics, UK
Outline • What is PRAM • Why we needed to adapt the PRAM method • Adapted PRAM Methodology • Disclosure risks • Effect on Data Quality • Conclusions
What is PRAM • PRAM is a disclosure control technique for categorical data in microdata files. • The values of a categorical variable are changed according to a prescribed probability. • Each new perturbed value may or may not be different from the original value. • For example, a person who is classified as a widow may be re-classified as single.
Probability mechanism for PRAM • The probability mechanism is described by an invertible transition matrix P • One P matrix for each variable • Let P=( pij ) be an LxL matrix for a variable having L categories. The entries of the matrix are the conditional probabilities. • pii is the probability of no change
Risk and data utility for PRAM Disclosure risk • PRAM offers protection by inflow and outflow: • inflow from safe combinations of values to risky combinations • outflow from risky combinations to safe combinations. Data Utility • the Invariant PRAM method preserves univariate frequencies in expectation • No control over joint distributions- may create edit failures, e.g. 14 year old doctor - or highly unusual combinations, e.g. 17 year old widow
Why adapt the PRAM method? • Applied to the 2001 Individual Sample of Anonymised Records (SARs) drawn from the Census.(know population uniques from Census records) • Used recoding as first method to reduce risk • Do not apply PRAM to the whole file • Perturb only remaining high risk records (small proportion of all records) • Wish to preserve exact univariate frequencies, not just expected values • Wish to control joint distributions to minimise edit failures and unusual combinations
Adapted PRAM Methodology • Perturbing only those records which are high risk • For the transition matrix, P we want to: • Maximise the probability of changing values • Preserve freqencies (ie P is invariant) • Create perturbed records that are feasible and will not result in highly unusual combinations • Define a linear programming problem
Adapted PRAM Methodology • The LP routine minimised the objective function, subject to constraints. The objective function is • We have set up a Weight Matrix to avoid extreme transitions. • Rather than having extreme changes that might create highly unusual individuals or invalid combinations, we prefer to keep the values as they are.
Implementation • PRAM variables sequentially - greatest contribution to risk first • Define weight matrix for each variable • LP solved in SAS, to get P transition matrix • PRAM within control variables (eg PRAM age within marital status categories) • Implementation of pij probabilities preserves exact frequencies • Check for edit failures, and correct • Perturbed records are flagged as being imputed (whether changed or not)
Results: Disclosure risks • Our aim was to only protect against attempts at exact matching. Assumed that perturbing the value of one variable in a high risk record provides sufficient protection • Protection by high outflow, but low inflowResults showed high proportions changed, except for last variables in sequence • Acceptable, since these variables had the lowest overall contribution to disclosure risk, and only a small number of records were affected
Results: Data Quality Preservation of the univariate frequencies - excellent results Preservation of the multivariate frequencies • very few records failed the edit checks • compare tables before and after PRAM: • Each cell: ratio of the relative error due to PRAM and relative sampling error
Effect on Data Quality • Results from 15 tables (nearly 3,000 cells) • The effect of perturbation relative to sample error decreases as the cell size increases. Thus the damage done by PRAM is greater for cells with low frequencies. Table 1: Percentage of Cells across all tables with a ratio of the error due to PRAM and the sampling error of greater than 1 and 2
Conclusions • As used in this context on targeted records, PRAM is an efficient method of data perturbation, which is well controllable. • Applying PRAM to a small proportion of the file has allowed us to strike a good balance between recoding and minimising the damage from perturbation.