1 / 11

Post-tabular Stochastic Noise

Post-tabular Stochastic Noise. to Protect Skewed Business Data. Sarah GIESSING, Federal Statistical Office of Germany Division Mathematical Statistical Methods. Stochastic Noise. Input perturbation (pre-tabular) (Example: Evans, T., Zayatz, L. and Slanta, J. (1998)) or

pikes
Download Presentation

Post-tabular Stochastic Noise

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Post-tabular Stochastic Noise to Protect Skewed Business Data Sarah GIESSING,Federal Statistical Office of Germany Division Mathematical Statistical Methods

  2. Stochastic Noise • Input perturbation (pre-tabular) (Example: Evans, T., Zayatz, L. and Slanta, J. (1998)) or • Output perturbation (post-tabular)? • Challenges for Post-tabular Noise: • Between tables consistency • Use micro-data seed when generating the noise(ABS (Fraser and Wooton, 2006)) • Table Additivity • Restoring additivity leads to between-tables inconsistency • Idea: enough to achieve near-additivity through Flexible Rounding

  3. Masking skewed business data using multiplicative noise • Pre-tabular approach (Höhne, 2008) • Multiply variable yi (in record i ) by (1 ± (  + zi)), zi~N(0, ) • Post-tabular approach • Tpost=Torig-y1+y1(1 ± (  + |zc|)) • Set to 0 for non-sensitive cells • =2p/100 => Tpost non-sensitive according to p%-rule • For between tables consistency: • Attach „seed“ variable to microdata • When making tables: Add up seed (➙Uc , f.i. Qc:= mod100(Uc) )=> consistent seed on the cell level (= Pseudo random numbers) • Both approaches: determines the strength of the perturbation

  4. Noise variances • Pre-tabular approach • V(tpre ) = Si=1,...,n ( ) • Post-tabular approach • V(tpost ) == , for non-sensitive cells (Is =0) because ofa2+b=1 • fV(tpost ) < V(tpre ) for non-sensitive cells

  5. Post-tabular noise – what about additivity • Noisy tables are not additive • Restoring additivity (Iterative methods, CTA)causes between-tables inconsistency • We want additivity – but how much? • Only a few users need exact additivity • For everyone else „approximate“ additivity (subject to rounding errors) is enough • Rounding also provides local information loss measure • What should be the rounding basis? • Idea: • Use width of confidence interval for Torig to compute rounding basis B=10b. • Require: RoundB(Torig)~ RoundB(Tpost) • Publish RoundB(Tpost)

  6. B=10 000 B= 100 B= 1 000 B= 10 B= 100 Confidence Interval and Rounding Basis • Confidence Interval for Torig: • Tpost • to model user‘s ambiguity about true parameters • =3 for 99% interval • Rounding Basis: • Require: RoundB(Torig)= RoundB(Tpost) (±1) • Example: Torig = 156 764, Tpost= 156 755,confidence interval [155 463;158 047] • Choose Publish 155 8XX

  7. 0 Results: An Example Turnover (in hundreds) by NACE x District x Size Class :sensitive

  8. Distribution of non-sensitive cells by relative deviation of the noise

  9. Disclosure Risks • Risk Type I: Masked value too close to original value • Not very critical: users can‘t tell which values are actually close • Risk Type II: post-tabular masked data are not additive, i.e. considering the table relations, they are not „feasible“. • Possible to compute feasibility interval for sensitive cells considering the constraints given by • Tables relations and • Rounding intervals • Feasibilty interval too close = obvious case of disclosure • No such cases found in empirical tests

  10. Conclusions • Flexible rounding of posttabular noisy data is a promising new method for flexible table servers • Paradigms hold: • Exact between-tables consistency • Near-additivity (only „rounding“ deviations) • Quality: • In tables with „usual“ detail: • More than 95% non-sensitive cells with less than 2% rel.dev. • In tables with too much detail for usual cell suppression methods: • More than 95% non-sensitive cells with less than 5% rel.dev. • Transparency: Influence of SDC on the data obvious to users • Risk: No „obvious“ disclosure risk found in testing so far • Easy to implement, computational effort negligible

  11. Thanks for your attention

More Related