160 likes | 258 Views
Confidentiality protection of large frequency data cubes. UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana Badina Statistics Norway. Eurostat Census Hypercubes.
E N D
Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana Badina Statistics Norway
Eurostat Census Hypercubes • 60 Census 2011 frequency count hypercubes that all 32 EU+EEA countries must submit in 2014. • Four to nine variables (breakdowns) in each cube. • Each country is responsible for its own disclosure control method according to national legislation. • Norway is the only country that wishes to use small count (1 and 2) rounding as the preferred disclosure control method. • This presentation will show how. • Hypercube 06 will be used for illustration.
Idea • We want to create uncertainties about whether zeroes are real zeroes. • Creating more zeroes from small counts (1 and 2) by rounding to 0 or 3 (unbiasedly) • The rounding must be carried out to minimize perturbation on given aggregate counts. • Counts of 1 and 2 are not necessarily considered problematic by themselves but will be removed by rounding.
Reduce the hypercube STEP 1: Identifying small counts • Reduce hypercube A by selecting a subset B consisting of • All interior cells in A with counts 1 or 2 or • all interior cells in A contributing to 1 or 2 in the PMDs of A. • Calculate C = A – B STEP 2:Rounding. • nB= total value of B • Round [nB/3] interior counts in Bto 3, the rest to 0. B*. • IF the solution B* is good enough, STOP. ELSE, continue search for a better B*. STEP 3: Calculate A* = C + B*, the rounded cube.
Simple properties • A* - A = B* - B = C • A*is additive • |nA – nA* | = |nA – 3[nA/3]| ≤ 1 • All Primary Marginal Distributions will be consistently rounded.
Rounding method used • Let nB = total count of B, e.g. nB = 3 199 • From the non-zero cells in B, select (WOR) [nB/3] (=1066) cells to be rounded to 3. • Probabilities: P(2 3) = 2·P(1 3) • Selection may be stratified. • Calculate distance m=maxcM|bc* – bc | across a control set M of marginal cells of B. • The solution with the smallest value m is selected.
Test experiment • Control set M : All one- and two-way marginal counts generated from the eight variables spanning HC 06. (1985 cells.) • 10 000 runs are done. • For full HC 06 and for the PMDs only • With stratified and unstratified sampling.
Discussion • The method is not yet fully approved for the Census HCs. • Is the method sufficient to prevent any kind of disclosure? • The reduction of the problem (A B) absolutely required to make the method work. • Advantage: • Can produce consistent results with acceptable (?) aggregate deviations for a number of linked cubes of some size. • Problems: • With random search the result is subject to chance. • Diminishing return from increasing the number of iterations. • We need to find better and more stable search engines. • Generalization to rounding bases of more than 3 will increase the deviations in aggregates.
Further work • Try better sampling procedures (Balanced sampling?) • Try Mixed Integer Linear Programming software. • Extend the experiment to round more hypercubes jointly. • An idea: Merge the reduced rounded cells back into microdata: • A method for perturbing some variables in relation to others. • How many variables must be perturbed this way to make all hypercubes safe? • Creates a micro data set that produces the rounded tables directly.