1 / 19

Tiziana Tuoto, Nicoletta Cibella, Marco Fortini Italian National Statistical Institute

The relationship between error rates and parameter estimation in the probabilistic record linkage context. Tiziana Tuoto, Nicoletta Cibella, Marco Fortini Italian National Statistical Institute ISTAT – Rome, Italy. Outline. Record linkage and Quality Record linkage as a statistical problem

mahala
Download Presentation

Tiziana Tuoto, Nicoletta Cibella, Marco Fortini Italian National Statistical Institute

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto,Nicoletta Cibella, Marco Fortini Italian National Statistical Institute ISTAT – Rome, Italy

  2. Outline • Record linkage and Quality • Record linkage as a statistical problem • Aim of the work: evaluate the linkage errors • Results on a case study • Further works

  3. Record linkage at Quality2008 The record linkage purpose is to identify, quickly and accurately, the same real world entity, which can be differently represented in data sources Widespread examples of applications (in official statistics field): • creation, update and de-duplication of frame • measure of population amount by capture-recapture model • check of the confidentiality of public-use microdata Record linkage procedures substantially improve the quality and quantity of the available information Warning in using data coming from linkage: must consider the quality of the linkage procedure

  4. Linkage Framework Preparatory activities Linkage Method adjustments Evaluation Usage

  5. Two data sets A and B, sizeNA and NB respectively Consider Ω= {(a,b), aA and bB} of size N=NANB The problem: classify the pairs in Ω in two subsets M and U mutually exclusive: M is the set of matches (a=b) U is the set of non-matches (a≠b) To classify the pairs, common identifiers (matching variables) are selected For each pairs a comparison vector is defined For example: Record linkage: formalization (1)

  6. Record linkage: formalization (2) The ratio of the distributions of  in the M and U subsets is used to classify the pairs The classification criterion is based on two thresholds Tmand Tu (Tm>Tu) The thresholds are chosen so that false match rate, FMR, and false non-match rate, FNMR, are minimized

  7. Choosing the threshold values The Fellegi and Sunter approach is heavily dependent on the accuracy of m() andu() estimates. Misspecifications in the model assumptions, lack of information and other problems can cause a loss of accuracy in the estimates and, as a consequence, an increase of both false match and non-match errors For this reason the appropriate thresholds values are often identified mainly through empirical methods

  8. Density function Increasing value of Theoretical situation U* M* Error  Error  Tm Tu

  9. Estimation of m() and u( ): the mixture model approach Armstrong and Mayda (1995) assume that the frequency distribution of the observed patterns is a mixture of the distributions of the matches m() and non-matches u() where p=P(M) is the match prior probability EM algorithm for the estimation

  10. Latent Class Analysis The joint distribution of the observations  and the latent variable C=c (c=(0,1) is given by: The likelihood function for mk(), uk() (k=1,…,K) and p is given by: EM algorithm for the parameter estimation

  11. Latent Class Analysis and model fitting Under the local independence assumption Warning: Local independence assumption can be not satisfied (often) Some authors Winkler (1989) and Thibaudeau (1989) introduce in the latent class models suitable constrains on the parameters in order to partially go over the local independence assumption Aims of the work is to study the relationship between the model fitting and the linkage error evaluation

  12. Case study: the data From the 2001 Italian Post Enumeration Census Survey • We know the true linkage status of all candidate pairs, due to the accuracy of the matching procedures adopted when estimating Census Coverage Rate through Capture-recapture model • File A from Census and file B from PES of about 650 records each one • 4 matching variables: Name, Surname, Day and Year of Birth. Block on Month of Birth

  13. X P(X=1|M) P(X=1|U) Surname 0.9853 0.0023 Name 0.9650 0.0074 Day of birth 0.9825 0.0327 Year of birth 0.9889 0.0127 Results under local independence assumption Probabilities P(Xk/M), P(Xk/U) and P(M) are computed for each ofthe4 matching variables by means of the EM algorithm under the local independence assumption P(M)=0.0013

  14. U* M* Tm Results under local independence assumption Fix only one threshold Tm=1, corresponding to the expected false match error FMR =0.001 . The resulting expected false non-match rate FNMR = 0.0001

  15. Results under local independence assumption The linkage results are “appreciable” but the linkage errors are not well estimated Observed FMR=0.017 vs the expected 0.001 Observed FNMR=0.010 vs the expected 0.0001

  16. X P(X=1|M) P(X=1|U) Surname 0.9853 0.0023 Name 0.9650 0.0074 Day of birth 0.9825 0.0327 Year of birth 0.9889 0.0127 Results using deterministic approach 1°Merge : (1,1,1,1) + on the 1°Merge-residuals 2°Merge : (1,0,1,1) + on the residuals 3°Merge : (1,1,0,1) Observed FMR=0.005 Observed FNMR=0.06

  17. Relaxing the conditional independence assumption Try to insert the interaction between matching variables, given the latent variable

  18. The true match distributions

  19. Further analyses Improving model fitting: • Distinguish between missing and inequality • deepen models based on categorical and/or continuous comparisons (Winkler, 2001) • Study the validity of the local independence assumption • Perturb real data to introduce associated errors in order to establish the relationship among model fitting, thresholds, linkage results and linkage errors

More Related