1 / 19

The Conditional Independence Assumption in Probabilistic Record Linkage Methods

The Conditional Independence Assumption in Probabilistic Record Linkage Methods. Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12 7TF stephen.sharp@gro-scotland.gsi.gov.uk. The record linkage problem.

simeon
Download Presentation

The Conditional Independence Assumption in Probabilistic Record Linkage Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12 7TF stephen.sharp@gro-scotland.gsi.gov.uk

  2. The record linkage problem • Given two files A and B, the aim is to find record pairs which refer to the same person. • This is done on the basis of linking fields common to the two files such as first name, last name, date of birth and postcode • The data matrix therefore looks like

  3. With four linking fields

  4. What is the assumption of conditional independence? • The likelihood that the two records refer to the same person is measured by a log likelihood ratio

  5. What is the assumption of conditional independence? • This is much easier to work out if the observations are independent conditional on match status because now

  6. Why is the assumption of conditional independence important? • It keeps the numbers of parameters manageable – linear rather than exponential relation to the number of linking fields • Enables the use of frequency based agreement weights • Speeds up computing time • Improves stability of parameter estimation • But is almost always wrong e.g. gender is almost wholly predictable from first name • But does it matter?

  7. Who adopts the conditional independence assumption? • Rec Link (US Census Bureau) – yes • Link Plus (US Centers for Disease Control and Prevention) – yes • GRLS/Fundy (Statistics Canada) – yes • ORLS – yes (probably) • RELAIS (Italian Statistical Institute) - no

  8. Two questions • To what extent is the assumption violated in real data sets? • How much effect does it have on the output of linkage software?

  9. What does the assumption look like in practice?A = Agree D = DisagreeM = Match N = Non-match

  10. Calculating the correlations between linkage fields • Run 1 – Rec Link - a 10% sample of the 2001 Scottish Census and the 2001 census coverage survey – one blocking field and seven linkage fields • Run 2 – Link Plus – a sample of the Scottish NHSCR data base and HESA records of Scottish students studying in England or Wales

  11. Run 1 - tetrachoric correlations for matches in the Census/CCS data – medium linkage scores only

  12. Run 1 - tetrachoric correlations for non-matches in the Census/CCS data – medium linkage scores only

  13. Run 2 - tetrachoric correlations for matches in the NHSCR/HESA data – medium linkage scores only

  14. Run 2 - tetrachoric correlations for non-matches in the NHSCR/HESA data – medium linkage scores only

  15. So the assumption of independence is significantly violated. Does it matter? • Runs 3, 4 and 5. All using the census/CCS data and with Link Plus but different treatments of the date of birth • Run 3 – specific to date format treating the date as one field (so not assuming independence) but with “intelligence” • Run 4 – day, month and year treated as three separate fields (and therefore as independent) • Run 5 – day, month and year concatenated and treated as one field (so not assuming independence) but with no “intelligence”

  16. Is run 4 worse than runs 3 and 5?

  17. Run 6 – the Clackmannanshire data

  18. Conclusions • Work in progress and limited amounts of data currently available • No evidence that the assumption of conditional independence has negative effects on output quality • Future intentions include bringing in more packages such as RELAIS v2.2 and wider variety of data sets where training data is available • For the moment, any views on the methods used and/or findings so far?

  19. The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12 7TF stephen.sharp@gro-scotland.gsi.gov.uk

More Related