Enhancing Data Integration Strategies: ESSnet Project Results

Results Of A Project On Record Linkage, Statistical Matching And Micro Integration: The ESSnet On Data Integration Mauro Scanu Istat scanu@istat.it ESSnet workshop: Roma 4 December 2012

ESSnet “Data integration” An ESSnet on data integration was launched on December 2009 Partners: Italy, Netherlands, Norway, Poland, Spain, Switzerland Length: 2 years Summary webpage: http://www.essnet-portal.eu/essnet-projects/ongoing-essnet-projects/data-integration Objective: to spread data integration know-how in the ESS WP1: state-of-the-art update WP2: methodological developments WP3: software issues WP4: case studies WP5: dissemination (on-the-job training courses on specific methods, one course, one final workshop, contacts with other ESSnets, ISI conference)

What kind of “data integration”? • We focused on the statistical methods of data integration: • Record linkage: look for the same unit in two sources • Statistical matching: look for joint information on variables observed in two samples with no units in common • and on the methods that make the integrated data set useful for statistical analysis • Micro integration processing

Micro integration • Micro-integration is the method that aims at improving the data quality in combined sources by searching and correcting for the errors on unit level, in such a way that: • the validity and reliability of the statistical outcomes are optimized, • only one figure on one phenomenon is published, • variables from different sources can be combined, and • accurate longitudinal outcomes can be published.

A model for possible errors in the joint analysis of two data sets

Methoological developments and case studies • Methodological developments • Consistency at the micro level • Consistency at the macro level • Statistical matching • Record linkage • Case studies • Constructing a register combining different sources on the topic employment • Construct the educational attainment variable

Consistency at the micro level Example: Business record with data from two sources Problem: Inconsistency among figures due to edit constraints

Consistency at the micro level • edit-rules: • a1: x1 - x5 + x8 = 0 (Profit = Turnover – Total Costs) • a2: x5 - x3 - x4 = 0 (Turnover = Turnover main + Turnover other) • a3: x8 - x6 - x7 = 0 (Total Costs = Wages + Other costs) • Objective: we need to combine the different pieces of information, survey values, register values and edit-rules, to obtain a record such that • the record contains the register values for variables for which these are available and • the record satisfies the edit constraints. • Results: different distances have been tested and applied; presence of “hard” and “soft” constraints

Consistency at the macro level • Example: produce a set of hypercubes for the census, estimating each hypercube from a different data source (or combination of data sources), and imposing that each joint distribution available from two hypercubes is the same (Dutch virtual census). • Available method: consistent repeated weighting, anyway this method harmonizes one table at a time. It becomes harder the larger the number of tables to reconcile (i.e. the larger the number of contraints to impose) • Solution: the consistency problem is reduced to an optimization problem: • Compute each hypercube • Look for the nearest hypercubes to the observed ones, under constraints on the equality of the joint distribution of the variables in common between each pair of hypercubes

Statistical matching Example: produce an estimate of the joint distribution of a pair of variables observed in distinct data sets, with no units in common (e.g. expensitures in a survey and income in another) Available method: statistical matching – the idea is to avoid creating a fictitious syntetic data set with joint observations of the variables at the unit level, but to estimate the distribution from the available data sets. The absence of joint information produces uncertainty (e.g. Fréchet bounds) Objective: compare some of the alternatives available in the literature (file concatenation, calibration) Results: file concatenation seems attractive, although sometimes difficult to apply when dealing with samples drawn according to complex survey designs

Record linkage Example: a data set on enterprises with information on the enterprise economic situation and patents Available methods: record linkage Results: We tested a new Bayesian approach (Tancredi and Liseo, Annals of applied statistics, 2011). A comparison on real data shows that results are similar to the ones obtained by Fellegi and Sunter. The main advantage is in the possibility to estimate directly parameters related to variables observed distinctly in the two sources, without the creation of a linked data file

Case study 1 • Objective: • constructing a register combining different sources on the topic “employment”. • quality evaluation of the register-based employment statistics of Norway by comparing with LFS at small area level. The comparison has been done not using the traditional approach of considering unit-misclassification but by comparing at table level. • Assumptions and tools: no bias in LFS and no variance in REG. For estimating MSE of the register-based statistics, a multilevel model was used. By using the best linear unbiased predictor (EBLUP) estimators from the multilevel model, we are able to compare the MSEs, and find that the register-based method mostly outperforms LFS at municipality level.

Case study 2 • Objective: • construct the educational attainment variable with the use of register data (7), instead of using only a sample survey (eg LFS). • Description of the case study: • Description of the sources to integrate • Micro integration (attention is given to representativity of the files, date of reference and consistency) • Estimates (estimator combines the part available from registers and the one from LFS – see picture). Kuijvenhoven and Scholtus (2011) show the conditions under which the combined estimator has a lower mean square error (MSE) than a direct sample-based estimator. • Measurement of accuracy

Case study 2

Other case studies and methodological developments • Case studies • Integration of small and medium enterprises with fiscal statements • First Steps in Profiling Italian Patenting Enterprises • Methodological developments • editing errors in the relations between units when linking economic data sets to a population frame • handling incompleteness after linkage to a population frame: incoherence in unit types, variables and periods • bootstrapping combined estimators based on register and survey data

Software issues Record linkage Relais, a software developed at Istat, has been improved with some pre-processing techniques Statistical matching StatMatch, an R package, has been improved with tools on assessing uncertainty, and with a vignette

Workshop and course Some dissemination tools: Three on-the-job training courses (Poland, UK, Latvia). Data sets used for these on-the-job training courses are anonymous and can be used for training and testing purposes. One course on statistical matching, record linkage and micro integration (people from 13 EU, 2 candidate countries, plus Eurostat and ECB) One workshop: Madrid 25-26 November 2011 http://www.ine.es/e/essnetdi_ws2011.html 5 invited speakers (W. E. Winkler, B. Liseo, P.L. Conti, M. Lenk, E. Golata), 15 contributed papers from Europe, Australia and USA. http://www.essnet-portal.eu/di/wp5-dissemination-towards-ess/final-workshop-madrid-november-2011

Project members Cristina Casciano, Nicoletta Cibella, Paolo Consolini, Marco Di Zio, Marcello D’Orazio, Marco Fortini, Daniela Ichim, Filippo Oropallo, Laura Peci, Francesca Romana Pogelli, Mauro Scanu, Monica Scannapieco, Giovanni Seri, Tiziana Tuoto, Luca Valentino, Jeroen Pannekoek, Arnout van Delden, Bart Bakker, Paul Knottnerus, Léander Kuijvenhoven, Frank Linder, Nino Mushkudiani, Dominique van Roon, Eric Schulte Nordholt, Jean-Pierre Renfer, Daniel Kilchmann, Marcin Szymkowiak, Adam Ambroziak, Dehnel Grażyna, Tomasz Józefowski, Tomasz Klimanek, Jacek Kowalewski, Ewa Kowalka, Andrzej Młodak, Artur Owczarkowski, Jan Paradysz, Wojciech Roszka, Pietrzak Beata Rynarzewska, Magdalena Zakrzewska, Francisco Hernandez Jimenez, Gervasio-Luís Fernández Trasobares, Miguel Guigó Pérez, Johan Fosen, Li-Chun Zhang

Enhancing Data Integration Strategies: ESSnet Project Results

Enhancing Data Integration Strategies: ESSnet Project Results

Presentation Transcript

Paola Anitori - ISTAT

ISTAT

Antonio R. Discenza : discenza@istat.it Silvia Loriga : siloriga@istat.it

Mauro Gargiulo

Modernisation in Istat

Authors: M. Murgia – murgia@istat.it A. Nunnari – nunnari@istat.it Presented by: M. Murgia

Piero Demetrio Falorsi , Paolo Righi  falorsi@istat.it , parighi@istat.it 

Antonio R. Discenza : discenza@istat.it Silvia Loriga : siloriga@istat.it