1 / 19

Many-to-Many Linkage: Finding Maternal Siblings in Birth Registration Data

Many-to-Many Linkage: Finding Maternal Siblings in Birth Registration Data. Charles Morris, Shelley Gammon, Christos Chatzoglou, Lynda Cooper, Julie Mills, Stephen Milner & Theodore Manasiss. Pregnancy Spine – Finding Maternal Siblings in Birth Registration Data.

scates
Download Presentation

Many-to-Many Linkage: Finding Maternal Siblings in Birth Registration Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Many-to-Many Linkage: Finding Maternal Siblings in Birth Registration Data Charles Morris, Shelley Gammon, Christos Chatzoglou, Lynda Cooper, Julie Mills, Stephen Milner & Theodore Manasiss.

  2. Pregnancy Spine – Finding Maternal Siblings in Birth Registration Data • Joint Data Linkage, Data as a Service, ADRCE, & Big Data project. • Goal: To find and link maternal siblings within Birth Registration Data. • Pre 2006: Mother’s NHS number not recorded. • New Challenges with many-to-many linkage. • Application to wider Health Outcomes, Child and Maternal Health Projects.

  3. Pregnancy Spine – Finding Maternal Siblings in Birth Registration Data Mother Born 1976 Entry Birth 1 2001 Birth 2 (still) 2002 Birth 3 2004 Mother’s educational qualifications First baby hospital visit Third baby congenital anomaly First baby starts school 2006 First baby GCSE results 2017

  4. Data Information • England and Wales Birth Registration data: • 2000 – 2005 live and still births. • 3,722,737 total records. • Attribute variables include: • Address of mother, Father’s name, Child’s DOB. • Child’s registration, Stillbirth indicator. • Matching variables: • Mother’s first name, Maiden/surname, DOB, COB. • Data Processing.

  5. Linking Mothers to Identify Maternal Siblings • Rachel Jones, dob – 03/07/1982: boy, George, b. 28/08/2000. • Rachel Jones, dob – 03/07/1982: boy, Paul, b. 28/08/2000. • RachaelJones, dob – 03/07/1982: girl, Stella, b. 02/01/2003. • What is record 3 linked to if we remove 1 and 2? • Deterministic methods: useful 1st step for 1-1 matching • For many-to-many linkage: Probabilistic methods.

  6. Computational Capability & Blocking • Blocking used to reduce the comparison space. • 3 Blocking Passes: • Block on Local Authority. • Block on surname trigram, YOB, COB. • Block on forename bigram, DOB, MOB. • Combine records from 2nd & 3rd blocking passes and de-duplicate.

  7. Probabilistic Matching • Identifies many-to-many links to create familial clusters. • Model: Fellegi-Sunter probabilistic matching. • Matching variables used: • Mother’s forename, maiden & surname, DOB, COB. • Levenshtein Distance for partial agreement. • For partial agreement, interpolate between agreement & disagreement weights.

  8. Graph Theory and Network Analysis Map by Joachim Bering, 1613

  9. Graph Theory and Network Analysis Compeau, P - Pevzner, P - Tesler, G - Nature Biotechnology 29, 987 991 (2011) doi:10.1038/nbt.2023

  10. Network Analysis of Linkage Data – Representing Sibling Communities

  11. Network Analysis of Linkage Data – Representing Sibling Communities

  12. Network Analysis of Linkage Data – Representing Sibling Communities

  13. Community Detection Algorithms • Cluster Edge Betweenness. • Betweenness Centrality • Cluster Modularity. M. E. J. Newman 2006 103 (23) 8577-8582; 2006, doi: 10.1073/pnas.0601602103

  14. Network Analysis of Linkage Data – Representing Sibling Communities

  15. Network Analysis of Linkage Data – Representing Sibling Communities

  16. Cluster Evaluation and QA

  17. Drawbacks and Limitations • Blocking on Local Authority. • Linking on Neighbouring Local Authority. • Tailoring Community Detection Algorithms. • Recursion. • High Performance Processing Requirements. • 348 Local Authorities.

  18. Future Implementations • Custom Community Detection Algorithms with Neo4j based on: • Date of Birth, • Maximum Number of Nodes per Cluster. • Incorporation of other data sources. • Application to other data linkage problems within ONS and wider government. • Application to wider domains including Health Outcomes, Child, and Maternal Health Projects.

  19. Thank You Any Questions?

More Related