190 likes | 204 Views
Many-to-Many Linkage: Finding Maternal Siblings in Birth Registration Data. Charles Morris, Shelley Gammon, Christos Chatzoglou, Lynda Cooper, Julie Mills, Stephen Milner & Theodore Manasiss. Pregnancy Spine – Finding Maternal Siblings in Birth Registration Data.
E N D
Many-to-Many Linkage: Finding Maternal Siblings in Birth Registration Data Charles Morris, Shelley Gammon, Christos Chatzoglou, Lynda Cooper, Julie Mills, Stephen Milner & Theodore Manasiss.
Pregnancy Spine – Finding Maternal Siblings in Birth Registration Data • Joint Data Linkage, Data as a Service, ADRCE, & Big Data project. • Goal: To find and link maternal siblings within Birth Registration Data. • Pre 2006: Mother’s NHS number not recorded. • New Challenges with many-to-many linkage. • Application to wider Health Outcomes, Child and Maternal Health Projects.
Pregnancy Spine – Finding Maternal Siblings in Birth Registration Data Mother Born 1976 Entry Birth 1 2001 Birth 2 (still) 2002 Birth 3 2004 Mother’s educational qualifications First baby hospital visit Third baby congenital anomaly First baby starts school 2006 First baby GCSE results 2017
Data Information • England and Wales Birth Registration data: • 2000 – 2005 live and still births. • 3,722,737 total records. • Attribute variables include: • Address of mother, Father’s name, Child’s DOB. • Child’s registration, Stillbirth indicator. • Matching variables: • Mother’s first name, Maiden/surname, DOB, COB. • Data Processing.
Linking Mothers to Identify Maternal Siblings • Rachel Jones, dob – 03/07/1982: boy, George, b. 28/08/2000. • Rachel Jones, dob – 03/07/1982: boy, Paul, b. 28/08/2000. • RachaelJones, dob – 03/07/1982: girl, Stella, b. 02/01/2003. • What is record 3 linked to if we remove 1 and 2? • Deterministic methods: useful 1st step for 1-1 matching • For many-to-many linkage: Probabilistic methods.
Computational Capability & Blocking • Blocking used to reduce the comparison space. • 3 Blocking Passes: • Block on Local Authority. • Block on surname trigram, YOB, COB. • Block on forename bigram, DOB, MOB. • Combine records from 2nd & 3rd blocking passes and de-duplicate.
Probabilistic Matching • Identifies many-to-many links to create familial clusters. • Model: Fellegi-Sunter probabilistic matching. • Matching variables used: • Mother’s forename, maiden & surname, DOB, COB. • Levenshtein Distance for partial agreement. • For partial agreement, interpolate between agreement & disagreement weights.
Graph Theory and Network Analysis Map by Joachim Bering, 1613
Graph Theory and Network Analysis Compeau, P - Pevzner, P - Tesler, G - Nature Biotechnology 29, 987 991 (2011) doi:10.1038/nbt.2023
Network Analysis of Linkage Data – Representing Sibling Communities
Network Analysis of Linkage Data – Representing Sibling Communities
Network Analysis of Linkage Data – Representing Sibling Communities
Community Detection Algorithms • Cluster Edge Betweenness. • Betweenness Centrality • Cluster Modularity. M. E. J. Newman 2006 103 (23) 8577-8582; 2006, doi: 10.1073/pnas.0601602103
Network Analysis of Linkage Data – Representing Sibling Communities
Network Analysis of Linkage Data – Representing Sibling Communities
Drawbacks and Limitations • Blocking on Local Authority. • Linking on Neighbouring Local Authority. • Tailoring Community Detection Algorithms. • Recursion. • High Performance Processing Requirements. • 348 Local Authorities.
Future Implementations • Custom Community Detection Algorithms with Neo4j based on: • Date of Birth, • Maximum Number of Nodes per Cluster. • Incorporation of other data sources. • Application to other data linkage problems within ONS and wider government. • Application to wider domains including Health Outcomes, Child, and Maternal Health Projects.
Thank You Any Questions?