210 likes | 224 Views
Learn about safety rules in statistical disclosure control for tabular data focusing on SDC for magnitude tables, existing safety rules, generalised p-rule, rational estimates, prior distribution, U-estimates, and more.
E N D
UN/ECE Work Session On Statistical Data Confidentiality (Geneva, 9-11 November 2005)WP30: Safety rules in statistical disclosure control for tabular data Giovanni Merola Winton Capital Management Ltd g.merola@wintoncapital.com Partially written while at ISTAT and partially supported by EU project CASC.
Plan of the Talk • SDC for Magnitude tables; • Existing safety rules; • Generalised p-rule; • Rational estimates; • Prior distribution; • U-estimates; • Comparison on real SBS data; • MU-rules; • Concluding remarks.
1. SDC for Magnitude Tables Tables showing the sums of non-negative contributions in each cell. Example: Contributions in non-increasing order Total 600 (Old Males) Total T is published n is number contrib.n
1. SDC for Magnitude Tables cont.d SDC policy: • If the categories are confidential, (likely) identification of respondents is disclosure; • else only the contributions of (likely) identifiable respondents cannot be disclosed (too precisely); • same rule for all cells, else microdata protection.
2. Existing Safety Rules • Rare respondents are identifiable • threshold rule: n > m. • Respondents with large contrib. are identifiable • Dominance: (z1+···+zm)/T k. • Largest contributor is identifiable, hence second largest must not estimate z1 closely • p-rule: [(T-z2) -z1]/z1> p.
3. Generalised p-rule Includes the existence of groups of respondents • Group with largest sum identifiable; • group with second largest sum must not estimate largest sum too closely; Total is T R2,2 t2
t1=z1 and R1,1=z2 p-rule 3. Generalised p-rule cont.d Same estimate as p-rule: maximum possible value ^tm=T-Rm,l • Gen. p-rule ((T-Rm,l)-tm)/tm > p
3. Generalised p-rule cont.d • If zero contributions are known (external intruder): Dominance rule with k=1/(1+p) • If no groups: simple p-rule; • If intruding group formed of (m-1) respondents: threshold rule n>m protects against exact estimation (p=0). Merola, G. M., 2003a. Generalized risk measures for tabular data. Proceedings of the 54th Session of the International Statistical Institute.
4. Rational Estimates • An intruder can compute a lower and an upper bound for the value of tm: • For example, if z2=40 and T=100: 40=z2 z1 T- z2=60; • the bounds are different for different prior knowledge of the intruder.
4. Rational Estimates cont.d • tm can be estimated by minimising the Mean Square Error for some distribution F(tm) : • for a well known property MSE is minimised by the mean
5. Prior Distribution: Uniform • The ignorance about the distribution of tm can be modelled with a Uniform distribution: tm~U(tm-, tm+) • in this case the mean is simply: • Note: same estimate for any symmetric F.
5. Prior Distribution: maximising The Generalised p-rule can be derived by assuming a prior concentrated on the maximum value • We refer to the Gen p-rule as M-rule, and to the that derived using the Uniform as U-rule.
6. U-estimates Different prior knowledge of the intruder • knows T but not n: • knows T and n, • knows T and L contributions, • knows T, L contributions and n,either as above or • * for m=L=1 uniform p-rule is same as uniform dominance (Dominance); (Gen. p-rule*) Merola, G., 2003b. Safety rules in statistical disclosure control for tabular data. Contributi Istat 1, istituto Nazionale di Statistica, Roma.
6. U-estimates cont.d Example C=(970,376,274,253,203,169,161,121,86,62,21,10), T=2706
7. Comparison on real SBS data We applied different rules to Italian SBS data, turnover by Region and SIC for the years ’94 and ‘97. We considered the SIC with 2 and 3 digits.
7. Comparison on real SBS data cont.d Mean relative error forz1
7. Comparison on real SBS data cont.d Mean relative error fort2
8. U-rules • The values for are intervals: • Knowing only T (Dominance) • Knowing T and L contributions (gen p-rule)
9. MU-rules • assuming both estimating approaches we obtain subadditive rules, analogous to p-rule but with stricter bounds
9. MU-rules cont.d • Safety rule when only T known (Dominance) • Safety rule when T and L contributions known(gen p-rule)
10. Conclusions • The assumptions for the existing rules are unrealistic; • using a simple noninformative distribution much smaller relative error of estimation; • the corresponding rules are not subadditive; • joining assumptions leads to stricter rules; • identifiability of all largest respondents requires these rules; • different prior can be used.