110 likes | 125 Views
Post-tabular Stochastic Noise. to Protect Skewed Business Data. Sarah GIESSING, Federal Statistical Office of Germany Division Mathematical Statistical Methods. Stochastic Noise. Input perturbation (pre-tabular) (Example: Evans, T., Zayatz, L. and Slanta, J. (1998)) or
E N D
Post-tabular Stochastic Noise to Protect Skewed Business Data Sarah GIESSING,Federal Statistical Office of Germany Division Mathematical Statistical Methods
Stochastic Noise • Input perturbation (pre-tabular) (Example: Evans, T., Zayatz, L. and Slanta, J. (1998)) or • Output perturbation (post-tabular)? • Challenges for Post-tabular Noise: • Between tables consistency • Use micro-data seed when generating the noise(ABS (Fraser and Wooton, 2006)) • Table Additivity • Restoring additivity leads to between-tables inconsistency • Idea: enough to achieve near-additivity through Flexible Rounding
Masking skewed business data using multiplicative noise • Pre-tabular approach (Höhne, 2008) • Multiply variable yi (in record i ) by (1 ± ( + zi)), zi~N(0, ) • Post-tabular approach • Tpost=Torig-y1+y1(1 ± ( + |zc|)) • Set to 0 for non-sensitive cells • =2p/100 => Tpost non-sensitive according to p%-rule • For between tables consistency: • Attach „seed“ variable to microdata • When making tables: Add up seed (➙Uc , f.i. Qc:= mod100(Uc) )=> consistent seed on the cell level (= Pseudo random numbers) • Both approaches: determines the strength of the perturbation
Noise variances • Pre-tabular approach • V(tpre ) = Si=1,...,n ( ) • Post-tabular approach • V(tpost ) == , for non-sensitive cells (Is =0) because ofa2+b=1 • fV(tpost ) < V(tpre ) for non-sensitive cells
Post-tabular noise – what about additivity • Noisy tables are not additive • Restoring additivity (Iterative methods, CTA)causes between-tables inconsistency • We want additivity – but how much? • Only a few users need exact additivity • For everyone else „approximate“ additivity (subject to rounding errors) is enough • Rounding also provides local information loss measure • What should be the rounding basis? • Idea: • Use width of confidence interval for Torig to compute rounding basis B=10b. • Require: RoundB(Torig)~ RoundB(Tpost) • Publish RoundB(Tpost)
B=10 000 B= 100 B= 1 000 B= 10 B= 100 Confidence Interval and Rounding Basis • Confidence Interval for Torig: • Tpost • to model user‘s ambiguity about true parameters • =3 for 99% interval • Rounding Basis: • Require: RoundB(Torig)= RoundB(Tpost) (±1) • Example: Torig = 156 764, Tpost= 156 755,confidence interval [155 463;158 047] • Choose Publish 155 8XX
0 Results: An Example Turnover (in hundreds) by NACE x District x Size Class :sensitive
Distribution of non-sensitive cells by relative deviation of the noise
Disclosure Risks • Risk Type I: Masked value too close to original value • Not very critical: users can‘t tell which values are actually close • Risk Type II: post-tabular masked data are not additive, i.e. considering the table relations, they are not „feasible“. • Possible to compute feasibility interval for sensitive cells considering the constraints given by • Tables relations and • Rounding intervals • Feasibilty interval too close = obvious case of disclosure • No such cases found in empirical tests
Conclusions • Flexible rounding of posttabular noisy data is a promising new method for flexible table servers • Paradigms hold: • Exact between-tables consistency • Near-additivity (only „rounding“ deviations) • Quality: • In tables with „usual“ detail: • More than 95% non-sensitive cells with less than 2% rel.dev. • In tables with too much detail for usual cell suppression methods: • More than 95% non-sensitive cells with less than 5% rel.dev. • Transparency: Influence of SDC on the data obvious to users • Risk: No „obvious“ disclosure risk found in testing so far • Easy to implement, computational effort negligible