Tiziana Tuoto, Nicoletta Cibella, Marco Fortini Italian National Statistical Institute

The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto,Nicoletta Cibella, Marco Fortini Italian National Statistical Institute ISTAT – Rome, Italy

Outline • Record linkage and Quality • Record linkage as a statistical problem • Aim of the work: evaluate the linkage errors • Results on a case study • Further works

Record linkage at Quality2008 The record linkage purpose is to identify, quickly and accurately, the same real world entity, which can be differently represented in data sources Widespread examples of applications (in official statistics field): • creation, update and de-duplication of frame • measure of population amount by capture-recapture model • check of the confidentiality of public-use microdata Record linkage procedures substantially improve the quality and quantity of the available information Warning in using data coming from linkage: must consider the quality of the linkage procedure

Linkage Framework Preparatory activities Linkage Method adjustments Evaluation Usage

Two data sets A and B, sizeNA and NB respectively Consider Ω= {(a,b), aA and bB} of size N=NANB The problem: classify the pairs in Ω in two subsets M and U mutually exclusive: M is the set of matches (a=b) U is the set of non-matches (a≠b) To classify the pairs, common identifiers (matching variables) are selected For each pairs a comparison vector is defined For example: Record linkage: formalization (1)

Record linkage: formalization (2) The ratio of the distributions of  in the M and U subsets is used to classify the pairs The classification criterion is based on two thresholds Tmand Tu (Tm>Tu) The thresholds are chosen so that false match rate, FMR, and false non-match rate, FNMR, are minimized

Choosing the threshold values The Fellegi and Sunter approach is heavily dependent on the accuracy of m() andu() estimates. Misspecifications in the model assumptions, lack of information and other problems can cause a loss of accuracy in the estimates and, as a consequence, an increase of both false match and non-match errors For this reason the appropriate thresholds values are often identified mainly through empirical methods

Density function Increasing value of Theoretical situation U* M* Error  Error  Tm Tu

Estimation of m() and u( ): the mixture model approach Armstrong and Mayda (1995) assume that the frequency distribution of the observed patterns is a mixture of the distributions of the matches m() and non-matches u() where p=P(M) is the match prior probability EM algorithm for the estimation

Latent Class Analysis The joint distribution of the observations  and the latent variable C=c (c=(0,1) is given by: The likelihood function for mk(), uk() (k=1,…,K) and p is given by: EM algorithm for the parameter estimation

Latent Class Analysis and model fitting Under the local independence assumption Warning: Local independence assumption can be not satisfied (often) Some authors Winkler (1989) and Thibaudeau (1989) introduce in the latent class models suitable constrains on the parameters in order to partially go over the local independence assumption Aims of the work is to study the relationship between the model fitting and the linkage error evaluation

Case study: the data From the 2001 Italian Post Enumeration Census Survey • We know the true linkage status of all candidate pairs, due to the accuracy of the matching procedures adopted when estimating Census Coverage Rate through Capture-recapture model • File A from Census and file B from PES of about 650 records each one • 4 matching variables: Name, Surname, Day and Year of Birth. Block on Month of Birth

X P(X=1|M) P(X=1|U) Surname 0.9853 0.0023 Name 0.9650 0.0074 Day of birth 0.9825 0.0327 Year of birth 0.9889 0.0127 Results under local independence assumption Probabilities P(Xk/M), P(Xk/U) and P(M) are computed for each ofthe4 matching variables by means of the EM algorithm under the local independence assumption P(M)=0.0013

U* M* Tm Results under local independence assumption Fix only one threshold Tm=1, corresponding to the expected false match error FMR =0.001 . The resulting expected false non-match rate FNMR = 0.0001

Results under local independence assumption The linkage results are “appreciable” but the linkage errors are not well estimated Observed FMR=0.017 vs the expected 0.001 Observed FNMR=0.010 vs the expected 0.0001

X P(X=1|M) P(X=1|U) Surname 0.9853 0.0023 Name 0.9650 0.0074 Day of birth 0.9825 0.0327 Year of birth 0.9889 0.0127 Results using deterministic approach 1°Merge : (1,1,1,1) + on the 1°Merge-residuals 2°Merge : (1,0,1,1) + on the residuals 3°Merge : (1,1,0,1) Observed FMR=0.005 Observed FNMR=0.06

Relaxing the conditional independence assumption Try to insert the interaction between matching variables, given the latent variable

The true match distributions

Further analyses Improving model fitting: • Distinguish between missing and inequality • deepen models based on categorical and/or continuous comparisons (Winkler, 2001) • Study the validity of the local independence assumption • Perturb real data to introduce associated errors in order to establish the relationship among model fitting, thresholds, linkage results and linkage errors

Tiziana Tuoto, Nicoletta Cibella, Marco Fortini Italian National Statistical Institute

Tiziana Tuoto, Nicoletta Cibella, Marco Fortini Italian National Statistical Institute

Presentation Transcript

ITALIAN NATIONAL AGENCY

by Marco Fortini and Gerardo Gallo The National Institute of Statistics, Italy

INTERNATIONAL STATISTICAL INSTITUTE

Orietta Luzi Istat - Italian Statistical Institute

INTERNATIONAL STATISTICAL INSTITUTE

Lorenzo Amati Italian National Institute for Astrophysics (INAF), Bologna

Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

The Italian Statistical System

Orietta Luzi, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

Giancarlo Carbonetti, Marco Fortini

Italian National Research Council Institute for Informatics and Telematics

ITALIAN NATIONAL AGENCY

ITALIAN NATIONAL AGENCY

Loredana Di Consiglio, Marco Fortini, Stefano Falorsi ISTAT

Monica Consalvi – Giuseppe Garofalo – Caterina Viviano Italian National Statistical Institute

Enrico Giovannini President of the Italian Statistical Institute

Orietta Luzi, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)