130 likes | 254 Views
Disclosure scenario and risk assessment: Structure of Earnings Survey. Daniela Ichim, Luisa Franconi Istat – DCMT – Methodology ichim@istat.it , franconi@istat.it. 1. Objectives of the anonymisation 2. Disclosure scenarios 3. Risk assessment 4. Confidentiality protection
E N D
Disclosure scenario and risk assessment: Structure of Earnings Survey Daniela Ichim, Luisa Franconi Istat – DCMT – Methodology ichim@istat.it, franconi@istat.it
1. Objectives of the anonymisation 2. Disclosure scenarios 3. Risk assessment 4. Confidentiality protection 5. Information content analysis Outline
Requirements: Member States Dissemination policy (Nace, Citizenship, Number of Employees, etc.) Coherence Users High-priority variables: NACE, NUTS, ISCO Minimum level of detail (NACE 2digits, Nuts1, ISCO 2digits …) Kinds of analysis Estimating the difference on Annual Earnings between two categories of the regional detail (estimating differences between regional politics) Weighted totals variation Objectives MICRODATA FILE FOR RESEARCH
Mimic the intruder knowledge and interest. POSSIBLE INTRUDER = RESEARCHER. No external register scenario No nosy colleague scenario Disclosure scenarios MICRODATA FILE FOR RESEARCH ONLY SPONTANEOUS IDENTIFICATION
Key variables Structural variables: NACE, NUTS, SIZE Enterprisespontaneous identification A sampled enterprise is considered at risk when both population and sample frequencies are simultaneously below the given threshold.
Enterpriseprotection Structural key variables are all categorical. Protection is achieved by recoding classes of the categorical key variable with the lowest priority: 1. Nace 2-digits 2. NUTS1 3. SIZE a) Recoding with respect to the population frequencies generates a lower information loss. b) If needed, recode another variable.
information on the enterprise (Nace x Nuts x Size) social variables (Gender x Age) extremely high earnings related to large enterprises Employees spontaneous identification MICRODATA FILE FOR RESEARCH
High AnnualEarnings: greater than the 99% quantile (T) for each combination of Nace, Nuts, Size, Gender, Age, AnnEarn the number of sampled employees with earnings greater than T was counted. If there was a single employee with such characteristics, it was considered at risk of identification. Employees at risk(use the scenario!)
Only records of employees at risk of identification ought to be perturbed. Only numericalkey variables are perturbed. Employees: selective protection MICRODATA FILE FOR RESEARCH
Controlled perturbation Weighted total variation inferior to 0.5%. Can be easily adapted to whatever stratification. Constrained regression
User requirements: Information preservation Weighted totals Sampling weights Only key and confidential variables are modified. Information loss Statistical indicators (correlations, summary statistics) Order relationships Information content
Confidentiality ensured, minimize the information loss. CONCLUSIONS Consider the dissemination features. Consider the data features.