180 likes | 250 Views
SDC for continuous variables under edit restrictions. Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006. Contents. The problem Evaluation data SDC techniques Additive noise Microaggregation Rounding Rank swapping Conclusions.
E N D
SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006
Contents • The problem • Evaluation data • SDC techniques • Additive noise • Microaggregation • Rounding • Rank swapping • Conclusions
The problem • Statistical disclosure control (SDC): microdata need to be protected against disclosure before release • Several SDC-techniques available for continuous microdata • Do not take edit constraints into account • Inconsistent microdata lead to loss of utility and pinpoint potential intruders to protected data • Problem: extend SDC techniques for continuous microdata to take edit constraints into account • Micro edits – record level inconsistencies • Macro edits – overall loss of utility (bias and variance)
Evaluation data • 2000 Israel Income Survey with three continuous variables (gross, net and tax) and one control variable (age) • 32,896 individuals of which 16,232 earned income from salaries • Edits: • E1a: gross≥ 0 • E1b: net ≥ 0 • E1c: tax ≥ 0 • E2: IF age ≤ 17 THEN gross ≤ 6,910 • E3: net + tax = gross
Additive noise • Generate random value and add to value to be protected • Random value can be drawn in several ways, depending on • Aiming to preserve variances or not • A single variable or multiple variables
Additive noise for a single variable using standard approach • Adding standard noise: perturb Y as follows • Y* = Y + e, e drawn from N(0, σ2) • Adding random noise to gross with σ2 = 0,2xVar(gross) resulted in 1,685 failures of E1 and 119 failures of E2 • Adding standard noise in groups • Define 5 equal groupings (quintiles) by sorting • Within each group applying above method resulted in 66 failures of E1 and no failures of E2
Additive noise for a single variable using correlated noise Perturb value Y as follows (Natalie’s trick): • Y* = d1Y + d2e, • d1 = (1- δ2)1/2, d2 = δ for positive parameter δ • e drawn from N((1-d1)/d2 x mean(Y), Var(Y)) • Note that • E(Y*)= E(d1Y) + E(d2e) = E(Y) • Var(Y*) = (1- δ2)Var(Y) + (δ2)Var(Y) = Var(Y) • Linear equations are preserved
Additive noise for multiple variables and linear programming • Perturb each variable Yi separately, resulting in Yi* • Adjust perturbed values Yi* slightly so that all edits become satisfied (LP-trick) • Minimize Σi |Yi* - Yi,final| subject to edit constraints • Yi,final are final perturbed values • Problem is simple linear programming problem
Additive noise for multiple variables using correlated noise Perturb vector Y by applying Natalie’s trick • Y* = d1Y + d2e, • d1 = (1- δ2)1/2, d2 = δ for positive parameter δ • e drawn from N((1-d1)/d2 x mean(Y), Var(Y)) • mean(Y) mean vector of Y; Var(Y) covariance matrix of Y • Means, covariances and equations are again preserved • E(Y*) = E(Y) • E(Var(Y*)) = E(Var(Y)) • Linear equations are preserved
Microaggregation • Replace value to be protected by average value in small group • Reduction in variance due to elimination of “within” variance • Microaggregation can be applied in several ways: • Standard version of microaggregation • Microaggregation followed by adding noise (to preserve original variance) and using linear programming to ensure preservation of linear equations (LP-trick) • Microaggregation followed by adding correlated noise to ensure preservation of linear equations (Natalie’s trick) • Avoids need for LP- trick but does not raise variance to expected level
Rounding • Round value to be protected to multiple of rounding base • Rounding can be applied in several ways: • Random rounding • Controlling totals and additivity • Controlling totals and additivity, and selecting all rounded values within base of original value
Random rounding • Univariate rounding with rounding base b • res(X) = X – largest multiple of b less than X • Round X up with probability res(X)/b and down with probability 1 - res(X)/b • Expectation of rounding is zero • In expectation totals are preserved
Random rounding: controlling totals • Select fraction of res(X)/b random entries to be rounded upward and round the rest downward • total is exactly preserved • gross is calculated as sum of rounded tax and net • gross may jump a base • apply reshuffling algorithm to correct this
Rank swapping • Sort variable to be protected and construct groupings, select random pairs in each group and swap values between pairs • Different group sizes lead to different results • Evaluation criteria: • AD = Σi |Xi,orig – Xi,pert|/nr • where i is cell in age group (14) x sex (2) x income group (22) • BV = Σj nj (averagej(X) – average(X))2/(p-1) • with j=1,..,p in age group (14) x sex (2)
Conclusion • Standard perturbation methods can be extended so they take (micro and macro) edit constraints into account • “Best” method to protect data set is to some extent subjective choice • Must provide protection against disclosure risk according to tolerable risk threshold • Must provide fit for purpose data according to needs of users