The Conditional Independence Assumption in Probabilistic Record Linkage Methods

The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12 7TF stephen.sharp@gro-scotland.gsi.gov.uk

The record linkage problem • Given two files A and B, the aim is to find record pairs which refer to the same person. • This is done on the basis of linking fields common to the two files such as first name, last name, date of birth and postcode • The data matrix therefore looks like

With four linking fields

What is the assumption of conditional independence? • The likelihood that the two records refer to the same person is measured by a log likelihood ratio

What is the assumption of conditional independence? • This is much easier to work out if the observations are independent conditional on match status because now

Why is the assumption of conditional independence important? • It keeps the numbers of parameters manageable – linear rather than exponential relation to the number of linking fields • Enables the use of frequency based agreement weights • Speeds up computing time • Improves stability of parameter estimation • But is almost always wrong e.g. gender is almost wholly predictable from first name • But does it matter?

Who adopts the conditional independence assumption? • Rec Link (US Census Bureau) – yes • Link Plus (US Centers for Disease Control and Prevention) – yes • GRLS/Fundy (Statistics Canada) – yes • ORLS – yes (probably) • RELAIS (Italian Statistical Institute) - no

Two questions • To what extent is the assumption violated in real data sets? • How much effect does it have on the output of linkage software?

What does the assumption look like in practice?A = Agree D = DisagreeM = Match N = Non-match

Calculating the correlations between linkage fields • Run 1 – Rec Link - a 10% sample of the 2001 Scottish Census and the 2001 census coverage survey – one blocking field and seven linkage fields • Run 2 – Link Plus – a sample of the Scottish NHSCR data base and HESA records of Scottish students studying in England or Wales

Run 1 - tetrachoric correlations for matches in the Census/CCS data – medium linkage scores only

Run 1 - tetrachoric correlations for non-matches in the Census/CCS data – medium linkage scores only

Run 2 - tetrachoric correlations for matches in the NHSCR/HESA data – medium linkage scores only

Run 2 - tetrachoric correlations for non-matches in the NHSCR/HESA data – medium linkage scores only

So the assumption of independence is significantly violated. Does it matter? • Runs 3, 4 and 5. All using the census/CCS data and with Link Plus but different treatments of the date of birth • Run 3 – specific to date format treating the date as one field (so not assuming independence) but with “intelligence” • Run 4 – day, month and year treated as three separate fields (and therefore as independent) • Run 5 – day, month and year concatenated and treated as one field (so not assuming independence) but with no “intelligence”

Is run 4 worse than runs 3 and 5?

Run 6 – the Clackmannanshire data

Conclusions • Work in progress and limited amounts of data currently available • No evidence that the assumption of conditional independence has negative effects on output quality • Future intentions include bringing in more packages such as RELAIS v2.2 and wider variety of data sets where training data is available • For the moment, any views on the methods used and/or findings so far?

The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12 7TF stephen.sharp@gro-scotland.gsi.gov.uk

The Conditional Independence Assumption in Probabilistic Record Linkage Methods

The Conditional Independence Assumption in Probabilistic Record Linkage Methods

Presentation Transcript

Probabilistic Record Linkage: A Short Tutorial

Conditional Independence

NCHS Record Linkage Activities

Record Linkage Survey

Record Linkage in a Distributed Environment

Issues with record linkage

Record Linkage in a Distributed Environment

Record linkage results

Blindfolded Record Linkage

Assumption-based Pruning in Conditional CSP

Record Linkage in Stata

Probabilistic Record Linkage in Genealogical Research

NCHS Record Linkage Program

Conditional Probability and Independence

Conditional Independence

(De-Identified) Record Linkage

Conditional Probabilities and Independence

Overview of Link Plus Probabilistic Record Linkage Software