240 likes | 373 Views
Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF. Jean-Marc Museux The Statistical Office of the European Communities Unit F3: Living conditions and social protection Jean-Marc.Museux@cec.eu.int. Outline. EU-SILC Task Force on Anonymisation
E N D
Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European CommunitiesUnit F3: Living conditions and social protection Jean-Marc.Museux@cec.eu.int
Outline • EU-SILC Task Force on Anonymisation • EU-SILC instrument and database • Methodological issues • Implementation • Conclusions Eurostat - UNECE worksession
EU-SILC Task Force on Anonymisation • Objective • To come up with best practices and recommendations for anonymisation of EU-SILC databases • Participants • B. Benard (Eurostat), L. Coppola (Istat), P. Feuvrier (INSEE), Ph. Gublin/J. Longhurst (ONS), N. Jukic (Stat of Slovenia), H. Minkel (Destatis), JM Museux (Eurostat), E. Schulte Nordholt (CBS), H. Sauli (Stat Fin) Eurostat - UNECE worksession
EU-SILC instrument • Instrument: - gathering ex post harmonised micro data - on income and living conditions - from 27 European States • Regulatory framework • Harmonised definitions • Minimum methodological requirements (probability sampling, fieldwork, …) • Methodological recommendations • Main source for EU (income) poverty indicators Eurostat - UNECE worksession
EU-SILC instrument • Variables • Income (Canberra recommendations) • Demographic • Labour status • Living conditions – housing – deprivation - health • Measurement units • Households and individuals Eurostat - UNECE worksession
EU-SILC instrument • Databases • Annual cross sectional data from 2004 onwards (households and individuals) • Longitudinal data (subset of individual variables) minimum 3 years spell (4 waves) • Data collection • Implementation under the responsibility of EU+ National Statistical Institutes • Flexibility • Rotational design, pure panel or independent components • Survey data and/or register data Eurostat - UNECE worksession
Release policy • Interest of the database • Social and employment policy monitoring (EU Commission services and Study centres) • Social research (Universities, Research centres) • Legal issues • Eu legislation allows for micro data release for scientific purpose • Micro data have to be anonymised in order to minimise the risk of disclosure of individual information • EU-SILC regulation plans scientific release according to a strict timetable Eurostat - UNECE worksession
Release policy • Eurostat main orientations • Right for information collected with public money • Maximise utility of data collected and social return of money invested (20 Mo € /year) • Significant improvement of the quality through user feedback • Implementation • Encrypted CD-ROM with anonymised EU-SILC database released under licence to researchers • Centralised (Luxembourg) Safe Centre with limited capacity • Decentralised access under study • Remote access not yet developed Eurostat - UNECE worksession
Anonymisation – Main issues • Heterogeneous environment in EU • Different perceptions of disclosure risk • No one European best practice • Various implementations of merely the same common principles • Significant variations of disclosure risk (i.e. Norwegian income register available on Web) • Harmonisation of procedures in order to ease international comparison Eurostat - UNECE worksession
Anonymisation – Main issues • Methodological issues • Common disclosure/attacker scenarios for EU purpose • Measures of risk • Hierarchical files (household and individual levels) • Longitudinal aspects • Cross sectional and longitudinal files matching • Sampling design information • Register matching • Methods of protection Eurostat - UNECE worksession
Methodological issues • Common disclosure/attacker scenarios • Broad band approach considering combinations of 3 types of identifying/key variables Eurostat - UNECE worksession
Methodological issues • Common EU disclosure/attacker scenarios • 3 additional and more complex attacker scenarios • EU1 (Simple attack with HH information (individual and household level) • REGION x SEX x YEAR OF BIRTH x MARITAL STATUS x HH SIZE x HH TYPE • EU2 (Nosy neighbour individual attack) • REGION x URBANISATION x SEX x DATE OF BIRTH x BASIC ACTIVITY STATUS x BATH OR SHOWER x DO YOU HAVE A CAR? x EDUCATION x OCCUPATION x SECTOR OF ACTIVITY x HH SIZE x HH TYPE • EU3 (Occupational group address book individual attack) • REGION x URBANISATION x SEX x DATE OF BIRTH x EMPLOYMENT STATUS x OCCUPATION x SECTOR OF ACTIVITY Eurostat - UNECE worksession
Methodological issues • Measure of risk and threshold • For broad band approach, thresholds are expressed in sample frequencies (heuristic developed by CBS-NL) Eurostat - UNECE worksession
Methodological issues • Measure of risk and threshold for more complex scenario • Probability of a correct match based the key variables between survey database and the attacker’s database • Measure developed by Benedetti and Franconi and available in Mu-Argus • Takes into account the hierarchical structure of the files : individuals/households • In practice, due to software limitation, only six variables are handled simultaneously and various combinations using subset of key variables are tested. Eurostat - UNECE worksession
Methodological issues • Hierarchical structure of information • Household and individual information are collected in EU-SILC • Household and individual records share common identifiers (linkable) • Possibility of linkage is required for many statistical studies • Increased risk of disclosure: individual information can be disclosed through household information and vice versa Eurostat - UNECE worksession
Methodological issues • Measure of risk and threshold • In addition, external information on population uniques (ONS) is used to cross check protection measures (for instance, 5+ households with age, sex of its members are often population unique up to high level of geographic aggregation) Eurostat - UNECE worksession
Methodological issues • Longitudinal data • The follow up of individuals through time generates rare transitions in some key variables. • These transitions are potentially disclosive if attacker database is updated with the same frequency • Corresponding risk is not easily estimated • Matching of longitudinal and cross sectional data files • For rotational panel and pure panel designs, the longitudinal and cross sectional files can be matched on the basis of common variables Eurostat - UNECE worksession
Methodological issues • Sampling design information • Design weights and strata identifiers are potentially disclosive because correlated with disaggregated geographical information • Register information • Few variables (income components) in EU-SILC are obtained directly from registers • The availability of register to attackers is limited except in rare situation (Income Register Norway and Tax register in Finland) Eurostat - UNECE worksession
Methodological issues • Methods of protection • Global/ top recoding • Usability of the database • Requires arbitrage between variables • Local suppressions • May render uneasy statistical analysis • Only if allow significant gain in global recoding of secondary variables Eurostat - UNECE worksession
Experiments • Level of recoding significantly decreasing disclosure risk • Geographic information needs to be coarsened depending on the size of the country (For large countries, NUTS1 and degree of urbanisation could be released) • Country of birth and Citizenship should be coarsened in 4 broad categories • Age can be delivered in years but must be top coded (80+). This avoids the difficulty of ensuring coherence of protection of longitudinal and cross sectional data • Number of rooms must be top coded (5+) • ISCED levels 5 and 6 must be regrouped • NACE is regrouped at 19 levels • ISCO 2 digit code can be released Eurostat - UNECE worksession
Remaining risks Identification of large households remains Rare transition in longitudinal data Sampling design information Specific national circumstances Researcher needs Household structure Longitudinal data for longitudinal analysis Design information for proper inference (not only variable but causal models) Harmonisation and flexibility Implementation Eurostat - UNECE worksession
Implementation • ECHP experience • Large dissemination in research community under license release • Less protection • No observed breach of confidentiality • For EU-SILC • Developing a responsible management of risk through controlled release and possibly audit provision and follow up. Eurostat - UNECE worksession
Implementation • Eurostat approach • Common rules for anonymisation of national databases • Residual flexibility is allowed to adapt to national situations following national assessment according to common standards (measure of risk and thresholds, …) Eurostat - UNECE worksession
Conclusions • Anonymisation is a matter of trade off • Among national perception of disclosure risk • Between right for privacy and researcher need • Between presence of risk and monitoring of risk • Value added of EU-SILC TF • These trade off have been debated and made explicit Eurostat - UNECE worksession