Brian Shand, Fiona McRonald, Katherine Henson, Cong Chen (Public Health England)

Pseudonymised Matching: Robustly Linking Molecular and Prescription Data to Cancer Registry Data in England Brian Shand, Fiona McRonald, Katherine Henson, Cong Chen (Public Health England)

Overview • Motivation: matching patients between data feeds is challenging • The OpenPseudonymiser approach to pseudonymisation with one-way hash functions • Extending OpenPseudonymiser with encrypted demographics • Results: linkage of national prescription data, BRCA mutation screening data • Conclusion Pseudonymised matching: robustly linking molecular and prescription data to cancer registry data in England

Motivation – information needs • Cancer registry data is extremely sensitive, and challenging to link: • The English cancer registration service (NCRAS) cannot reveal who has cancer to external providers • External providers cannot give identifiable data for patients without cancer – NCRAS can however hold data on patients with (suspicion of) cancer • This makes sensitive feeds without a cancer marker difficult to access, e.g. national prescription data, BRCA molecular screening data • screening for mutations in BRCA1 or BRCA2 genes identifies people with increased risk of developing breast and/or ovarian cancer. • 50%-65% of women with a BRCA1 mutation develop breast cancer by age 70, and 35%-46% ovarian cancer. • if patients develop cancer later, the mutation data would add value Pseudonymised matching: robustly linking molecular and prescription data to cancer registry data in England

Key idea • We want to pseudonymise cancer registry data and another data source in the same way: • If the same patient is in both data sources, they will get the same pseudo-id. • Demographics / sensitive fields can be encrypted, so that only a trusted party – who also knows the linkage demographics – can decrypt them. • Non-demographic fields are generally not disclosive, and do not need to be encrypted (at least within our secure cancer database). Pseudonymised matching: robustly linking molecular and prescription data to cancer registry data in England

Useful concepts • Hashing • Irreversible scrambling algorithm • Secret salt • Information making hashing context-specific • Reversible encryption Pseudonymised matching: robustly linking molecular and prescription data to cancer registry data in England

Illustrative slide Pseudonymised matching: robustly linking molecular and prescription data to cancer registry data in England

Hashes and OpenPseudonymiser • We start with the OpenPseudonymiser approach, which uses SHA-256 to generate pseudonyms for each patient: • SHA-256 is a one-way hash function (and cryptographically secure) • given x, it’s straightforward to compute y = sha256(x), but • given y, it’s impossible to reconstruct x, without trying all possibilities by brute force. • The pseudonyms are secure, if the salt is secret, and “long enough” (e.g. 256 bits of random data). • Replace each patient identifier with a pseudonym, derived from the NHS number (national healthcare identifier) • researchers can link their datasets, without sharing patient demographics • pseudonym = sha256(NHS number + salt)E.g. sha256('1234567881’ + 'ab00ec62fa2ad275b08471cbfc76cb85 80f92283f3663baff0ea7d83aee57e19') = ' 778aebfe72aefcf391d00 96333bf325837981ba60ba8a5921be37789307321d3' Pseudonymised matching: robustly linking molecular and prescription data to cancer registry data in England

OpenPseudonymiser • Research teams use the same salt as a shared secret. • Patients with the same NHS number will be given the same pseudonym778aebf…d3 <=> 778aebf…d3 • Without knowing the salt, the pseudonyms are non-identifiable. • Ordinary researchers cannot access to the salt: only a trusted linkage function can use it, and the secrecy of this is contractually agreed.(If the salt is known, a brute force attack could be possible.) • Patients must match exactly by NHS number (or other demographics used for matching purposes, e.g. postcode + date of birth) • OpenPseudonymiser only protects the key demographics (NHS number); the clinical data is treated as non-identifiable • Patients must match exactly by NHS number (or whatever demographics tuple is used for matching purposes). OpenPseudonymiser does not support complex patient matching (e.g. NHS number + surname + month and year of birth) Pseudonymised matching: robustly linking molecular and prescription data to cancer registry data in England

This is the top half of the slide Pseudonymised matching: robustly linking molecular and prescription data to cancer registry data in England

Extending OpenPseudonymiser • We have extended OpenPseudonymiser-like pseudonymisation to support fuzzy patient matching, and clinical data encryption. • As in OpenPseudonymiser, pseudonyms identify possible matches, i.e. records in which the registry has a legitimate interest. • We use the plaintext linkage demographics to generate a secondary encryption key, e.g. • per-record encryption keys are used for additional demographics, and clinical data • keys combine patient pseudonym, random key, and additional salt Pseudonymised matching: robustly linking molecular and prescription data to cancer registry data in England

Extending OpenPseudonymiser 2 • The cancer registry keeps an isolated database of pseudonymised data and keys, to match registry patients against. • Where the core demographics match, the remaining demographics will be unpacked, and used for fuzzy patient matching. • If the demographics match score is high enough, the clinical data will be unpacked and released to the encore cancer registration database. • No access to identifiable data for patients not suspected to have cancer • The pseudonymised dataset itself can also be used for baseline comparisons, e.g. to compare how often a particular prescription drug was dispensed to lung cancer patients, vs the overall population. • By including patient age as a derived, non-disclosive field in the pseudonymised data, baseline comparisons can be age standardised. Pseudonymised matching: robustly linking molecular and prescription data to cancer registry data in England

This is the full slide Pseudonymised matching: robustly linking molecular and prescription data to cancer registry data in England

Applications in PHE • Public Health England has access to pseudonymised national prescription data feeds from NHS Business Services Authority, and BRCA and other genetic mutation screening data. These have been linked to the cancer registry. Decrypted birthdates help validate NHS number matches. • Four months of prescription data (332 million prescriptions, 29 million people) matched 1.6 million cancer patients: 88% of living cancer patients had a prescription record. • We now have 47 months of prescription data linked to the cancer registry • Non-disclosive fields need not be pseudonymised, so the pseudonymised dataset allows baseline comparisons against the cancer-linked cohort. For BRCA screening data, this identified nearly 1,300 unique variants from 7,000 screening patients, and an overall variant detection rate of about 25%. In prescription data, cancer patients were compared with age-matched controls. Pseudonymised matching: robustly linking molecular and prescription data to cancer registry data in England

Conclusion • Linking data from external sources to the cancer registry creates a powerful resource to better understand patient experience over their lifetimes. • Pseudonymised matching can help to unlock data sources which include people without cancer. • We have done this for prescribing and screening data. • cong.chen@phe.gov.uk • ncrasenquiries@phe.gov.uk Pseudonymised matching: robustly linking molecular and prescription data to cancer registry data in England

Brian Shand, Fiona McRonald, Katherine Henson, Cong Chen (Public Health England)