1 / 18

WP 10

WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation. Yosi Rinott Hebrew University. Natalie Shlomo Hebrew University Southampton University. Disclosure Risk Measures Notation: Sample (size n ): Population (size N ):

alida
Download Presentation

WP 10

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Yosi RinottHebrew University Natalie ShlomoHebrew University Southampton University

  2. Disclosure Risk Measures Notation: Sample (size n): Population (size N): Tables with K cells: m-way table Risk Measures: = expected number of correct matches of sample uniques Estimates:

  3. On Definitions of Disclosure Risk • In the statistics literature, we present examples of risk measures, and , but we lack formal definitions of when a file is safe • In the computer science literature, there is a formal definition of disclosure risk (e.g., Dinur, Dwork, Nisim (2004-5), Adam and Wortman(1989), who write “it may be argued that elimination of disclosure • is possible only by elimination of statistics”) • In some of the CS literature any data must be released with noise. • The noise must be small enough so that legitimate information on large subsets of the data will be useful, and large enough so that information on small subsets, or individuals will be too noisy and therefore useless (regardless of whether they are obtained by direct queries or differencing etc.)

  4. On Definitions of Disclosure Risk Worst Case scenario of the CS approach (for example, that the intruder has all information on anyone in the data set except the individual being snooped) simplify definitions, there is no need to consider other, more realistic but more complicated scenarios. But would Statistics Bureaus and statisticians agree to adding noise to any data? Other approaches like query restriction or query auditing do not lead to formal definitions.

  5. Definition of Disclosure Risk Numerical Data Base , A Query is a sum over a subset of . Query is perturbed by adding some noise of magnitude Proven that almost all can be reconstructed if and none of them can be reconstructed if Adding noise of order hides information on individuals and small groups, but yields meaningful information about sums of O(n) units for which noise of order is natural. Work further expanded to lessen the magnitude of the noise by limiting the number of queries.

  6. Definition of Disclosure Risk • Collaboration with the CS and Statistical Community where: • In the statistical community, there is a need for more formal and clear definitions of disclosure risk • In the CS community, there is a need for statistical methods to preserve the utility of the data • - allow sufficient statistics to be released without perturbation • - methods for adding correlated noise • - sub-sampling and other methods for data masking • Can the formal notions from CS and the practical approach of statisticians lead to a compromise that will allow us to set practical but well defined standard for disclosure risk?

  7. Probabilistic Models Focus on sample microdata and not whole population (sampling provides a priori protection against disclosure) Standard (natural) Assumptions ind. Bernoulli or Poisson sampling In particular the size biased Poisson distribution

  8. Probabilistic Models Add iid As ( ) we obtain the mu-argus assumption As ( ) we obtain the above Poisson Model

  9. Mu-Argus Model (Benedetti, Capobianchi, Franconi (1998)) is the sampling weight of individual i obtained from design or post-stratification where If then but are underestimated risk is under estimated Monotonicity: if we replace by some , risk estimates increase to the correct level in , but how to estimate ?

  10. Poisson Log-linear Models (Skinner, Holmes (1998), Elamir, Skinner (2005), Skinner, Shlomo (2005)) Monotonicity in the size of the model (number of parameters): Saturated (“big” model) data under fitted risk underestimated Independent (“small” model) data over fitted risk overestimated Intermediate models with conditional independence involves smaller products of marginal proportions and therefore we expect monotonicity of the models, so similar to the choice of , there will be a model which will give a good risk estimate

  11. Neighborhood of a Log-linear Model Log-linear models takes into account a neighborhood of cells to infer on for determining the risk. For example:Independence Neighborhood, k=(i,j): The estimate is the productof marginal proportions obtainedby fixing one attribute at a time, thus if one attribute is incomegroup then inference on very richinvolves information on very poor,provided there is another attributein common, such as marital status. j i

  12. Discussion of Neighborhoods • How likely is a sample unique a population unique? • If a sample unique has mostly small or empty neighboring cells, it is more likely to be a population unique. • Argus is based on weights and no learning from other cells. • The log linear Poisson model takes into account neighborhoods, reduces the number of parameters and also reduces their standard deviation and hence of risk measures (provided that the model is valid). • Are there other types of neighborhoods which may be more natural? • We focus on ORDINAL variables

  13. Proposed Neighborhoods • Local smoothers for large sparse (ordinal) tables, e.g. Bishop, Fienberg, Holland (1975), Simonoff (1998) • Use local neighborhoods to fit a simple smooth function to or to estimate smoothly • Construct neighborhood of cells of k, by varying the coordinates of ordinal attributes, and fixing non-ordinal attributes • Neighborhood of cell k at distance c from cell k

  14. Proposed Neighborhoods Regressors, for cell k: j i Define structural zeros if all neighborhoods of a cell which are used in the regression contain only empty cells

  15. Example Population from 1995 Israeli Census File, Age>15, N=746,949, n=14,939, and K=337,920 Key: Sex(2), Age groups(16), Groups of years of study(10), Number of years in Israel(11), Income groups(12), Number of persons in household (8) Sex is not ordinal and is fixed Weights for Argus obtained by post-stratification on weighting classes: sex, age and geographical location

  16. Example

  17. Results of Example • Independent log-linear model and neighborhoods over estimate the two risk measures • Argus Model under estimates • The all 2-way interaction log-linear Poisson Model has the best estimates • Taking into account the structural zeros in the neighborhoods yield more reasonable estimates

  18. Conclusions • Need to refine the neighborhood approach, define the model better and develop MLE theory • We expect the new model to work well in multi-way tables when simple log-linear models are not valid • Incorporate the approach into a more general regression model, the Negative Binomial Regression, which subsumes both the Poisson Risk Model and the Argus Model

More Related