250 likes | 414 Views
ALSPAC Record Linkage to External Databases. Andy Boyd ALSPAC, Social Medicine University of Bristol. The data sources and processes involved. The processes involved in linkage projects Overview of ALSPAC’s existing data linkage projects National Pupil DB & Geographic linkage as examples
E N D
ALSPAC Record Linkage to External Databases Andy Boyd ALSPAC, Social Medicine University of Bristol
The data sources and processes involved • The processes involved in linkage projects • Overview of ALSPAC’s existing data linkage projects • National Pupil DB & Geographic linkage as examples • Data Availability & Linkage Problems
Processes involved in linkage projects • Find the contact • Ethics – informed consent and/or Section 60 support • Data Security • HM Revenue & Customs • Creating a linkage data set • Data QC checks • Identifiers • Formats and data ‘normalisation’
Processes involved in linkage projects cont… • Who links the data? • one of the two parties or an independent 3rd party • Processing the data • Anonymity vs sufficient data for research • Ages in Months & Years • First Half of Postcode • Recode unusual outcomes into wider categories
Major External Databases • Health related datasets • Office National Statistics (ONS) Tracing • – Cancer Registry & GRO • NSTS (NHS Strategic Tracing Service) • Electronic antenatal & birth records • PCT data (Exeter DB, My Quest)* • Non health Datasets • National Pupil Database (DCSF, DIUS*, UCAS*) • ALSPAC Schools Collection • G.I.S Datasets (Geographic Information Systems) • DWP* • Home Office* * Linkage currently being investigated
National Pupil Database • Maintained by Dept. Children Schools & Families • Covers all state maintained schools in England • Annual / now 3 time points, census • Data at school and pupil level • Key data include: • Exam results • Attendance • Pupil demographics (including address, ethnicity, Free School Meals, Special Educational Needs) • School Characteristics (pupil numbers, staff pupil ratios)
NPD – How we did it • 3rd party conducted match – The Fischer Trust – independent charity • Provided data on the eligible cohort • ALSPAC & DCSF provided the following linkage variables: • Surname, Forename, Familiar name • Date of Birth, Gender • Postcode, Previous Postcode & Postcode accuracy flag • Current School (from ALSPAC data collection)
NPD - Details • ALSPAC Cohort covers three academic years • We hold data on all YPs across these three years – approx. 600,000 cases a year • Figures based on eligible cohort 17671 linked (86%) • Majority of unlinked cases thought to be in private education (will be in NPD from KS4)
NPD - Advantages • Covers all English state schools • Good match rate for eligible cohort • Regular updates • Access to ‘confidential’ variables • PLUG workshops provide good opportunities to discuss data and solutions to problems
NPD - Problems • Central ID QC issues (a few duplicates) • Only applies to English state maintained until KS4, then re-link – extra costs and bias until then • Data collection method/standards varies from school to school • Documentation (lack of) • Size of raw data, time consuming to process • Fixed time point census, doesn’t record all school movements (especially annual census)
G.I.S Data • Spatial data held at many geographic levels • Geographies range in scale from 0.1 meters to regional/national data • Tied together via postcode or grid reference as central ID • Key data include: • NSPD ( was All Fields Postcode Directory) - geo linking database • Deprivation & Socio Economic indices (IMD, Townsend, Acorn) • Census data
G.I.S – How we link cases to data • Master file of Postcodes • Postcodes linked to grid reference • Grid references of various scales • PCs/GridRef mapped to: • Electoral geographies • Census geographies • Ethics: • We don’t generally identify residence at PC or equivalent level Ordinance Survey – The National Grid
G.I.S - Details • 50,000 ALSPAC address points, associated with a date range which can then be linked to ALSPAC data collection • Linkage examples: • Indices of multiple deprivation • Travel from home to school patterns • Cancer rates and residential distance from power lines The geographic relation between household income and polluting factories – FoE 1999
G.I.S advantages • Many data sets in public domain (or available through ‘athens’) • Many geographies are broad enough to not identify cohort members • National picture (some exclude Scotland)
G.I.S Problems • Shifting geographies across time points • Royal Mail change postcodes • Postcode not precise enough in some cases • Postcode boundaries are not contiguous with other geographic boundaries
Accuracy issues with analysis at postcode level Address level Postcode level
Accuracy issues with analysis at postcode level Address level Postcode level
Accuracy issues with analysis at postcode level Address level Postcode level
Data Availability & Linkage Problems Cohort Data GIS Data GIS Ethics
Linkage problems with the cohort data • Missing data • Especially problematic for the cases who didn’t enrol in the original recruitment • Partners • 69 cases with no known birth outcome • Gaps in the address data • However… • ONS matched 99.7% mothers, so we have their old & new NHS numbers and cleaned data (original recruitment cases only)
Linkage problems we encounter • Many of the early records are paper based or in varied formats. • Quality Control – ONS data returned to us with 37 incorrect ALSPAC Ids • Unknown methods – No documentation from ONS or Fischer regarding the quality of the match • Lack of uniqueness in the ID (either duplicates or multiple IDs per case)
GIS Data Availability • Collected as administrative resource • Not yet cleaned, documented and presented to usual ALSPAC standards • Initiatives under way to validate and fill gaps in record • Schools GIS data in the main not processed • Aim to build into standard ALSPAC resource
GIS Ethics • Postcode level or greater accuracy treated as a personal identifier • Research proposals to use these data need ALSPAC Law & Ethics Approval • Broader geographical data can be released in normal manner • A two-stage process is used to collect and process precise data
GIS Ethics Step 1 – Postcodes (or full address) provided to researcher with unique collection ID with no other data attached Step 2 – Researcher attaches their data and returns file to ALSPAC Step 3 – ID converted to the appropriate collaborator ID, postcode data removed Step 4 – Requested ALSPAC data added to the file and data sent to the researcher