240 likes | 249 Views
This presentation discusses the theory and practice of developing a record linkage software called RELAIS. It explores different techniques and phases involved in record linkage, as well as the experiences of using RELAIS in Italy and Spain. The presentation highlights the modular structure and open source nature of RELAIS, making it a useful tool for the scientific community.
E N D
NTTS 2009 Brussels 18-20 February 2009 Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella1, Gervasio-Luis Fernandez2, Marco Fortini1, Miguel Guigò2, Francisco Hernandez2, Monica Scannapieco1, Laura Tosco1, Tiziana Tuoto1 1Italian National Statistical Institute – ISTAT – Italy 2Spanish National Statistical Institute – INE – Spain
NTTS 2009 Brussels 18-20 February 2009 Theory and Practice in Developing a Record Linkage Software Outline • The Record Linkage • The ESSnet on ISAD • The Idea and the Features of the RELAIS Software • The Italian and Spanish Experiences in using RELAIS • Throughout RELAIS 2.0 • Conclusions Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009 Brussels 18-20 February 2009 Record Linkage The record linkage purpose is to identify the same real world entity, which can be differently represented in data sources Different approaches to deal with record linkage: Exact RL - Deterministic RL - Probabilistic RL (Fellegi and Sunter theory) - Bayesian RL - Machine Learning - Knowledge Representation … No particular technique has emerged as the best solution for all cases (maybe because such a solution does not exist…) Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009 Brussels 18-20 February 2009 Record Linkage Complexity The record linkage techniques are a multidisciplinary set of methods and practices • DECISION MODEL CHOICE • Fellegi & Sunter • Deterministic • Bayesian • Knowledge – based • Mixed • … • SEARCH SPACE REDUCTION • Sorted Neighbourhood Method • Blocking • Hierarchical Grouping • … ...... RECORD LINKAGE ...... ...... • COMPARISON FUNCTION CHOICE • Exact • Edit distance • Smith-Waterman • Q-grams • Jaro string comparator • Soundex code • TF-IDF • … • PRE-PROCESSING • Conversion of upper/lower cases • Replacement of null strings • Standardization • Parsing • … Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009 Brussels 18-20 February 2009 The Record Linkage Phases Record Linkage should be decomposed in its constituting phases as much as possible 1. Pre-processing of the input files • Creation-Reduction of the search space of link candidate pairs • Choice of the matching variables 4. Choice of the comparison function 5. Choice of the decision model 6. Selection of unique links 7. Record linkage evaluation Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009 Brussels 18-20 February 2009 The ESSnet ISAD: Integration of Surveys and Administrative Data The ESSnet and its focus The aim of the project is to arise, in the whole ESS, knowledge and understanding of the statisticalmethodologies for the integration of two (or more) data sources. Partners The ESSnet ISAD, cofinanced by Eurostat, started December 2006 and ended June 2008. The project involved 5 countries: ISTAT – Italy (scientific coordinator) STAT – Austria CZSO – Czech Republic CBS – Netherlands INE – Spain Nicoletta Cibella, Brussels, 19th February 2009
There is not a unique optimal solution for solving record linkage problems: for each phase the most appropriate technique should be chosen depending on application and data requirements, not only on the practitioner’s skill Ad-hoc record linkage process (workflow) should be dynamically built RELAIS (REcord Linkage At IStat) is a toolkit serving such a purpose NTTS 2009 Brussels 18-20 February 2009 RELAIS: The Idea Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009 Brussels 18-20 February 2009 Record Linkage Workflows RecLink WF Appl1 Preprocessing UpperLowerCase Normalization Normalization UpperLowerCase RecLink WF Appl2 Schema reconciliation SNM Search Space Reduction Blocking Blocking SNM Comparison Function Equality Jaro Edit Distance Jaro Equality Decision Model Probabilistic Probabilistic Empirical Empirical Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009 Brussels 18-20 February 2009 RELAIS Features • Modular structure: each phase is planned as a module of the toolkit, with an explicit interface with the other modules • Top-down design: this allows to omit and/or iterate modules (phases) of the record linkage process • Advantages: • dynamic composition of record linkage processes • parallel development of various techniques is allowed • design for Web service encapsulation in order to permit remote invocation Nicoletta Cibella, Brussels, 19th February 2009
Results produced by the scientific community in the last years can be gathered and made available • 175 000 papers mentioning “record linkage” (Google Scholar) • Techniques for each phase can be implemented and maintained very rapidly by relying on a community of developers • RELAIS Implementation Choices • Java • R statistical language NTTS 2009 Brussels 18-20 February 2009 RELAIS: An Open Source Project Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009 Brussels 18-20 February 2009 RELAIS: the First Release • SEARCH SPACE REDUCTION • Cross Product • Sorted Neighbourhood Method • Blocking • 1:1 REDUCTION • Optimised Transportation Problem RELAIS 1.0 • COMPARISON FUNCTION CHOICE • Equality • DECISION MODEL CHOICE • Fellegi & Sunter Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009 Brussels 18-20 February 2009 RELAIS: the First Release Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009 Brussels 18-20 February 2009 RELAIS in the Italian and Spanish Experiences • Common ideas and needs about the software (no ad-hoc solutions) • Sharing knowledge and cooperation started in the ESSnet • Evaluation of the RELAIS “adaptability” in order to solve also Spanish data integration problems Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009 Brussels 18-20 February 2009 RELAIS in the Italian Tests • A Scenario: the Data • Individuals data from the 2001 Italian Census and PES (about 180 000 each ones). • Capture-recapture model to estimate Census Coverage Rate, • - no matching errors in linking Census and PES records. • Linkage was a very complex operation: • deterministic and probabilistic approaches and clerical review • almost 15 matching variables • several working months. • Due to the accuracy of the matching procedures adopted, we know the true linkage status of all candidate pairs. Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009 Brussels 18-20 February 2009 RELAIS in the Italian Tests A focus on Rome Size of PES and CEN files : about 8 000 units each ones Cartesian Product CENxPES : more than 72 250 000 pairs (Expected link probability ≈ 0.0001) 1° Linkage Pass Blocking on month of birth of the household header variable Matching Variables: name, surname, gender, day-month-year of birth Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009 Brussels 18-20 February 2009 RELAIS in the Italian Tests Results of 1° Linkage Step Match Rate: 88% False Match Rate: 0.5% False Non-Match Rate: 12% The software also provides results at the block-level MATCH RATE TOO LOW IN COVERAGE CONTEXT Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009 Brussels 18-20 February 2009 RELAIS in the Italian Tests 2° Linkage Pass Residuals of the 1° step: about 1 500 units each file - mainly composed by records with missing value in the blocking variable at the 1° step; expected-link probability ≈ 0.0003 Cartesian Product : again not recommended … Blocking procedure by means of Sorted Neighborhoods Method Sorting variable: first letter of surname; window size = 450 (frequency of the most common first letter =250 ) Matching Variables: name, surname, day-month-year of birth Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009 Brussels 18-20 February 2009 RELAIS in the Italian Tests Theory and Practice in Developing a Record Linkage Software Results of the Overall Linkage Procedure(1° plus 2° steps) Match Rate: 98.5% False Match Rate: 0.8% False Non-Match Rate: 2.3% Working Time: less than 2 hours Nicoletta Cibella, Brussels, 19th February 2009
Blocking Equality Probabilistic 1:1 Step 1 NTTS 2009 Brussels 18-20 February 2009 RELAIS in the Italian Tests Theory and Practice in Developing a Record Linkage Software Rome PES Workflow RELAIS 1.0 Search Space Reduction Blocking SNM Cross Product Comparison Function Edit Distance Jaro-Winkler Equality Decision Model Probabilistic SNM Linking Type 1:1 Many:Many Equality Probabilistic 1:1 Step 2
NTTS 2009 Brussels 18-20 February 2009 RELAIS in the Spanish Tests • A Scenario: the Data • Individuals data from Living Conditions Survey (LCS) and Central Population Register (CPR) • 1st Main Objective: obtain ID number for LCS • 2nd Main Objective: compare the RELAIS results with ad-hoc procedures • Linkage was a very complex operation: • only “name” and geographical variables were available • large amount of data. • Blocking on geographic areas variables Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009 Brussels 18-20 February 2009 RELAIS in the Spanish Tests • Weaknesses of the RELAIS 1.0 • difficulties in managing great amount of blocks • difficulties in dealing with different probability estimations in each block • difficulties in writing the largest output files • Strengths of the RELAIS 1.0 • efficacy of the implemented probabilistic method • noticeable flexibility in modify/adapt the implemented functionalities (reduction from M:N to 1:1) Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009 Brussels 18-20 February 2009 Throughout RELAIS 2.0 Theory and Practice in Developing a Record Linkage Software • • A relational database architecture in order to optimize the performances with respect to the management of huge amount of data through the whole record linkage process (input, intermediate phase and output). • • Several distance functions for string and numerical comparisons (not only the equality one). • • Exact and deterministic decision models to be used either as alternatives or in conjunction with the probabilistic model. • • A data profiling phase to help the user in the critical phases of choosing the best blocking or matching variables. • One-shot Execution to deal with a large amount of blocks. • RELAIS 2.0 is now on testing and will be available from May 2009 Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009 Brussels 18-20 February 2009 Concluding Remarks Theory and Practice in Developing a Record Linkage Software • Profitable experiences in cooperation between NSIs. • Winning choice of the open-source philosophy and of the overcoming of ad-hoc approaches. • Common nature of problems and needs of NSIs in data integration projects. • New Challenge: • - Add in RELAIS methods for evaluating record linkage quality. Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009 Brussels 18-20 February 2009 RELAIS: Availability and Contacts Relais 1.0 is available on the website : www.istat.it Relais 2.0 will be available on May 2009 RELAIS Contacts: Nicoletta Cibella, Statistician E-mail: cibella@istat.it Tiziana Tuoto, Statistician E-mail: tuoto@istat.it Nicoletta Cibella, Brussels, 19th February 2009