1 / 19

WP 33 Information Loss Measures for Frequency Tables

WP 33 Information Loss Measures for Frequency Tables. Caroline Young University of Southampton Office for National Statistics cjy@soton.ac.uk. Natalie Shlomo University of Southampton Office for National Statistics n.shlomo@soton.ac.uk. Topics of Discussion Introduction

whiteruth
Download Presentation

WP 33 Information Loss Measures for Frequency Tables

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WP 33 Information Loss Measures for Frequency Tables Caroline YoungUniversity of SouthamptonOffice for National Statisticscjy@soton.ac.uk Natalie ShlomoUniversity of SouthamptonOffice for National Statisticsn.shlomo@soton.ac.uk

  2. Topics of Discussion • Introduction • Methods for perturbing frequency tables containing whole population counts • Information loss measures for assessing the impact of SDC methods on utility and quality • Data description and definition of tables • Examples and analysis of results • Conclusions and future research

  3. Introduction • Focus on frequency tables containing whole population counts: • UK Neighborhood Statistics (NeSS) website which disseminates small area statistics from census and administrative data • 2. Tables are intentionally perturbed for statistical disclosure control (SDC) causing information loss • 3. Develop quantitative information loss measures for choosing optimal SDC methods which preserves high utility in the tables • 4. Information loss depends on the SDC method, characteristics of the table and the use of the data

  4. SDC Methods for Frequency Tables • SDC for frequency tables containing population counts: • Small Cell Adjustments (SCA) – random rounding to base 3 of small cells: • Perturbation has a mean of zero and variance of 2. Marginal totals obtained by adding perturbed and non-perturbed cells • Full Random Rounding (RaRo) – random rounding to base 3 for all entries. Same method described above after converting all entries to residuals of 3. • Marginal totals rounded separately and tables aren’t additive Can improve utility by semi-controlling for marginal totals

  5. SDC Methods for Frequency Tables • SDC for frequency tables containing population counts (cont.): • Controlled Rounding (Cr(3)) – all entries rounded to base 3 according to solution of linear programming while ensuring that aggregated rounded internal cells equal the rounded margins. • Controlled rounding via Tau-Argus (standard tool for NeSS tables) • Cell suppression – small cells (ones and twos) are suppressed and secondary suppressions are found to protect against recalculation through margins. • Cell suppression via Tau-Argus and the hyper-cube method

  6. SDC Methods for Frequency Tables • SDC for frequency tables containing population counts (cont.): • Imputation methods for cell suppression: • Margins are known and the total of the suppressed cells are known • Impute by average of the total of the suppressed cells in each row (S-A) • Impute by weighted average of the total of the suppressed cells in each row where weights are the column totals (S-WA)

  7. Information Loss Measures • Measuring distortion to distributions:Distance metrics between original and perturbed cells in each geography (i.e., ward (NUTS5)) and average across all wards • Let be a table for ward k, the number of cells in the ward, the number ofwards, and the cell frequency for cell c : Hellinger’s Distance (HD) • Relative Absolute Distance (RAD) Average Absolute Distance per Cell (AAD)

  8. Information Loss Measures • Aggregation of perturbed cells and effects on sub-totals: • Users aggregate lower level geographies which are perturbed to obtain non-standard geographies • Calculate sub-total where • Impact on Tests for Independence:Cramer’s V measure of association: where is the Pearson chi-square statistic • Information loss measure:

  9. Information Loss Measures • Impact on Variance:- Little impact on variance of cell counts- “Between” variance of target variables for proportions in wards: Let the proportion in a ward k: and the overall proportion: • Between variance: • Information loss measure: • Mixed effects for this information loss measure

  10. Information Loss Measures • Impact on Rank Correlations: • Sort original cell counts and define deciles Repeat on perturbed cell counts • Information loss measure: where I is the indicator function and the number of wards • Log Linear Analysis: • Information loss measure based on the ratio of the deviance (likelihood ratio test statistic) between perturbed table and original table for a given model: • Need to also compare different models since model for original table may differ from model of perturbed table

  11. Data Used • Estimation Area Southwest England:437,744 persons, 182,337 households, 70 wards (on average 6,250 persons to a ward) • The tables were the following: •   Tenure(3) * Age (7) * Health(4) * Ward • Ethnicity (17) * Ward • Economic Activity (9) * Sex (2) * Long-Term Illness(2) * Ward

  12. Data Used

  13. RaRo RaRo RaRo CR3 CR3 CR3 SA SA SA SCA SCA SCA SWA SWA SWA Distance Metrics: (Left)-Hellinger’s Distance, (Centre)-Relative Absolute Difference and (Right)- Absolute Distance per cell

  14. Box Plots: Difference between Perturbed and Original Subtotals of Three Consecutive Wards (ADs) PAs for Number of Unemployed Females with Long Term Illness (Internal cells) Perturbation Method

  15. Change in Cramer’s V Measure of Association after Perturbation Percent Relative Difference 48.27 2.36 Increase in association Decrease in association

  16. Male Students Female Students CR3 RaRo SA SCA SWA Percentage of Cells in a Different Decile after Perturbation Male (column 1) Female (column 2) Students with Long Term Illness Percentage of cells N.B. The selected columns are very sparse with approx 70% of cells having counts < 4.

  17. Log-Linear Models: Effect of Perturbation on Model Selection Original Model: Choose a better model?

  18. Conclusions • Inconsistent results for some of the information loss measures (Cramer’s V, “between” variance) showing that stochastic processes for SDC will have varying effects on the quality of the data • Emergence of some guidelines: • - skewed tables (one or two large columns and the rest small columns) - prefer rounding to cell suppression • - uniform tables - less information loss due to SDC methods so choose method with least changes to the table • - sparse tables – need to have benchmarked totals so control round (if possible) or semi-control random round • Improve utility by: designing tables to avoid disclosive cells; controlling for totals when random or small cell rounding; giving clear guidance to users on how best to impute suppressed cells

  19. Future Research • Determine optimal methods of SDC depending on the use of the data and the characteristics of the table (skewed, sparse, uniform) • Generalize and expand information loss measures for all types of statistical data (tabular and microdata) and statistical analysis • Develop software to give to suppliers of data for assessing information loss under different SDC methods and choosing the optimal method which gives high utility tables

More Related