210 likes | 414 Views
The relationship between error rates and parameter estimation in the probabilistic record linkage context. Tiziana Tuoto, Nicoletta Cibella, Marco Fortini Italian National Statistical Institute ISTAT – Rome, Italy. Outline. Record linkage and Quality Record linkage as a statistical problem
E N D
The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto,Nicoletta Cibella, Marco Fortini Italian National Statistical Institute ISTAT – Rome, Italy
Outline • Record linkage and Quality • Record linkage as a statistical problem • Aim of the work: evaluate the linkage errors • Results on a case study • Further works
Record linkage at Quality2008 The record linkage purpose is to identify, quickly and accurately, the same real world entity, which can be differently represented in data sources Widespread examples of applications (in official statistics field): • creation, update and de-duplication of frame • measure of population amount by capture-recapture model • check of the confidentiality of public-use microdata Record linkage procedures substantially improve the quality and quantity of the available information Warning in using data coming from linkage: must consider the quality of the linkage procedure
Linkage Framework Preparatory activities Linkage Method adjustments Evaluation Usage
Two data sets A and B, sizeNA and NB respectively Consider Ω= {(a,b), aA and bB} of size N=NANB The problem: classify the pairs in Ω in two subsets M and U mutually exclusive: M is the set of matches (a=b) U is the set of non-matches (a≠b) To classify the pairs, common identifiers (matching variables) are selected For each pairs a comparison vector is defined For example: Record linkage: formalization (1)
Record linkage: formalization (2) The ratio of the distributions of in the M and U subsets is used to classify the pairs The classification criterion is based on two thresholds Tmand Tu (Tm>Tu) The thresholds are chosen so that false match rate, FMR, and false non-match rate, FNMR, are minimized
Choosing the threshold values The Fellegi and Sunter approach is heavily dependent on the accuracy of m() andu() estimates. Misspecifications in the model assumptions, lack of information and other problems can cause a loss of accuracy in the estimates and, as a consequence, an increase of both false match and non-match errors For this reason the appropriate thresholds values are often identified mainly through empirical methods
Density function Increasing value of Theoretical situation U* M* Error Error Tm Tu
Estimation of m() and u( ): the mixture model approach Armstrong and Mayda (1995) assume that the frequency distribution of the observed patterns is a mixture of the distributions of the matches m() and non-matches u() where p=P(M) is the match prior probability EM algorithm for the estimation
Latent Class Analysis The joint distribution of the observations and the latent variable C=c (c=(0,1) is given by: The likelihood function for mk(), uk() (k=1,…,K) and p is given by: EM algorithm for the parameter estimation
Latent Class Analysis and model fitting Under the local independence assumption Warning: Local independence assumption can be not satisfied (often) Some authors Winkler (1989) and Thibaudeau (1989) introduce in the latent class models suitable constrains on the parameters in order to partially go over the local independence assumption Aims of the work is to study the relationship between the model fitting and the linkage error evaluation
Case study: the data From the 2001 Italian Post Enumeration Census Survey • We know the true linkage status of all candidate pairs, due to the accuracy of the matching procedures adopted when estimating Census Coverage Rate through Capture-recapture model • File A from Census and file B from PES of about 650 records each one • 4 matching variables: Name, Surname, Day and Year of Birth. Block on Month of Birth
X P(X=1|M) P(X=1|U) Surname 0.9853 0.0023 Name 0.9650 0.0074 Day of birth 0.9825 0.0327 Year of birth 0.9889 0.0127 Results under local independence assumption Probabilities P(Xk/M), P(Xk/U) and P(M) are computed for each ofthe4 matching variables by means of the EM algorithm under the local independence assumption P(M)=0.0013
U* M* Tm Results under local independence assumption Fix only one threshold Tm=1, corresponding to the expected false match error FMR =0.001 . The resulting expected false non-match rate FNMR = 0.0001
Results under local independence assumption The linkage results are “appreciable” but the linkage errors are not well estimated Observed FMR=0.017 vs the expected 0.001 Observed FNMR=0.010 vs the expected 0.0001
X P(X=1|M) P(X=1|U) Surname 0.9853 0.0023 Name 0.9650 0.0074 Day of birth 0.9825 0.0327 Year of birth 0.9889 0.0127 Results using deterministic approach 1°Merge : (1,1,1,1) + on the 1°Merge-residuals 2°Merge : (1,0,1,1) + on the residuals 3°Merge : (1,1,0,1) Observed FMR=0.005 Observed FNMR=0.06
Relaxing the conditional independence assumption Try to insert the interaction between matching variables, given the latent variable
Further analyses Improving model fitting: • Distinguish between missing and inequality • deepen models based on categorical and/or continuous comparisons (Winkler, 2001) • Study the validity of the local independence assumption • Perturb real data to introduce associated errors in order to establish the relationship among model fitting, thresholds, linkage results and linkage errors