Supporting precise data analysis without releasing patient records: the Simulacrum in action

Supporting precise data analysis without releasing patient records: the Simulacrum in action Cong Chen, Paul Clarke, Lora Frayling, Sally Vernon, Brian Shand, Pesh Doubleday, Jem Rashbass

Overview • Context and goals of this talk • Background: our motivating problem • What is synthetic data and how does it help? • What is the goal of the data exercise? • Building a synthetic data model in the Simulacrum • Results and applications • Conclusion Presentation title - edit in Header and Footer

Talk Aims • Introduce and motivate concepts • Synthetic data • The information governance environment • Externally guided analysis • Describe and explain • The Simulacrum as synthetic data – what is it and how was it created? • Synthetic data-guided queries • How this has led to faster, more private answers Presentation title - edit in Header and Footer

Problems with sharing cancer data • Lots of data is available • This would enable researchers and industry to provide valuable insight into disease epidemiology, survival, clinical practice, resource utilisation, outcomes • Highly sensitive • Sharing data is an exercise in risk-reward balancing • Complex and intricate • Data dictionaries do not provide a perfect view of what to expect, analysis can be slow to converge Presentation title - edit in Header and Footer

Synthetic data • Data items which are not created by observations • This includes simulations (e.g. Synthea), partially synthetic data (generalised perturbation) and fully synthetic data • Does not represent individuals • Removes re-identification risk, but attribution risks remain Presentation title - edit in Header and Footer

Simulacrum project aims • Users should have direct access to a public resource • Showing data as it looks to internal analysts • Be able to identify their cohort and the cohort size, data completeness and quality, and the codes/ranges used • Be able to prepare and code algorithms against the synthetic data With a prepared analytical plan • Engage PHE with the proposed study • Share code which runs on the real data • Be able to complete analysis without releasing row-level or other sensitive data • Take a data-driven approach where possible • Use parameters • To adjust for differently sized or shaped datasets • To adjust to different privacy constraints/requirements Presentation title - edit in Header and Footer

Linked datasets • Data represents the course of patient treatment – we are interested in a coherent story and sensible timeline. • Patients can have multiple tumours, with very many treatment events – we need to capture this.

How did we do it? • Key idea: sample from empirical conditional distributions. • Question: how do we keep from running out of data? • Use low-dimensional distributions. • Question: which variables do we condition on? • Use independence tests to find strongly associated variables.

More details • Question: what do we do for linked tables? • Use all previous data (but in read-only mode). • Question: what about sequences of events? • Use information from the previous event (if it exists) and data in upstream tables – so a Markov model. • Question: what about sampling from small conditional distributions, which risk reflecting real individuals? • Cluster these distributions to meet accepted healthcare data standards.

What models look like (without the data)

The Simulacrum as a dataset • Version 1 – released 2018. 1.5 million tumours (corresponding to English incidences 2013-2015) with tumour/demographic/mortality data and chemotherapy treatment. • Representative at low dimensions (of variable combinations), not as good for complex detail. • Non-disclosive for public release. • Ongoing development.

How does it look? Cumulative age distribution (breast) Blue: Synthetic, Red: Real Cumulative age distribution (prostate)

Applications • Synthetic data used to back up a statistical query gateway (currently manual). • We’ve shared our synthetic data with partners to write queries against – those have turned out to be robust and aware of data formats, categories in our data and run against our data. • Publications accepted for conferences and journal articles. • We then try to release non-disclosive aggregates, model parameters/diagnostics without the personal data used to build those models. Presentation title - edit in Header and Footer

Current work • Better documentation of research and access process for less technical researchers • Model improvement, application in context of other datasets • More test-driven quality measures, automatic simulation with specific goals • Use other synthetic methodology within the data architecture • Fidelity isn’t objective – need to think about suitability for specific purpose

Conclusions • Synthetic data is a game changer for supporting research and reducing risks • This opens understanding of the data and analysis to a wider audience while reducing workload and misunderstandings • Realistic understanding of aims and expectations helps a synthetic data project improve mutual understanding Presentation title - edit in Header and Footer

Acknowledgements • Analyses were based on anonymous aggregate patient data from the National Cancer Registration and Analysis Service. • Thank you to NCRAS and HDI, as well as everyone working on or who has worked on the Simulacrum. • Pick up the data at https://simulacrum.healthdatainsight.org.uk • https://github.com/UCL-simulacrum/EDA is an amazing piece of work carried out by UCL students over 3 months with no reference to the real data. • cong.chen@phe.gov.uk • ncrasenquiries@phe.gov.uk Presentation title - edit in Header and Footer

Supporting precise data analysis without releasing patient records: the Simulacrum in action