190 likes | 307 Views
The Conditional Independence Assumption in Probabilistic Record Linkage Methods. Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12 7TF stephen.sharp@gro-scotland.gsi.gov.uk. The record linkage problem.
E N D
The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12 7TF stephen.sharp@gro-scotland.gsi.gov.uk
The record linkage problem • Given two files A and B, the aim is to find record pairs which refer to the same person. • This is done on the basis of linking fields common to the two files such as first name, last name, date of birth and postcode • The data matrix therefore looks like
What is the assumption of conditional independence? • The likelihood that the two records refer to the same person is measured by a log likelihood ratio
What is the assumption of conditional independence? • This is much easier to work out if the observations are independent conditional on match status because now
Why is the assumption of conditional independence important? • It keeps the numbers of parameters manageable – linear rather than exponential relation to the number of linking fields • Enables the use of frequency based agreement weights • Speeds up computing time • Improves stability of parameter estimation • But is almost always wrong e.g. gender is almost wholly predictable from first name • But does it matter?
Who adopts the conditional independence assumption? • Rec Link (US Census Bureau) – yes • Link Plus (US Centers for Disease Control and Prevention) – yes • GRLS/Fundy (Statistics Canada) – yes • ORLS – yes (probably) • RELAIS (Italian Statistical Institute) - no
Two questions • To what extent is the assumption violated in real data sets? • How much effect does it have on the output of linkage software?
What does the assumption look like in practice?A = Agree D = DisagreeM = Match N = Non-match
Calculating the correlations between linkage fields • Run 1 – Rec Link - a 10% sample of the 2001 Scottish Census and the 2001 census coverage survey – one blocking field and seven linkage fields • Run 2 – Link Plus – a sample of the Scottish NHSCR data base and HESA records of Scottish students studying in England or Wales
Run 1 - tetrachoric correlations for matches in the Census/CCS data – medium linkage scores only
Run 1 - tetrachoric correlations for non-matches in the Census/CCS data – medium linkage scores only
Run 2 - tetrachoric correlations for matches in the NHSCR/HESA data – medium linkage scores only
Run 2 - tetrachoric correlations for non-matches in the NHSCR/HESA data – medium linkage scores only
So the assumption of independence is significantly violated. Does it matter? • Runs 3, 4 and 5. All using the census/CCS data and with Link Plus but different treatments of the date of birth • Run 3 – specific to date format treating the date as one field (so not assuming independence) but with “intelligence” • Run 4 – day, month and year treated as three separate fields (and therefore as independent) • Run 5 – day, month and year concatenated and treated as one field (so not assuming independence) but with no “intelligence”
Conclusions • Work in progress and limited amounts of data currently available • No evidence that the assumption of conditional independence has negative effects on output quality • Future intentions include bringing in more packages such as RELAIS v2.2 and wider variety of data sets where training data is available • For the moment, any views on the methods used and/or findings so far?
The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12 7TF stephen.sharp@gro-scotland.gsi.gov.uk