460 likes | 576 Views
Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011. Overview. The SAIL system and how it operates Privacy Protection Issues and Drivers Privacy Protection approach Current developments Examples of research studies Future work.
E N D
Privacy Protection • & the SAIL Databank • David Ford • ECCONET Data Linkage Workshop • Bergen 15 – 17 June 2011
Overview • The SAIL system and how it operates • Privacy Protection Issues and Drivers • Privacy Protection approach • Current developments • Examples of research studies • Future work
What are HIRU and SAIL? • HIRU – the Health Information Research Unit • SAIL – Secure Anonymous Information Linkage • Main aim of HIRU is to realise the potential of electronically-held, routinely-collected, person-based data to conduct and support health-related studies • The SAIL databank already holds over 1 billion anonymised and encrypted individual-level records, from a range of sources relevant to health and well-being
Is SAIL a Cohort? • Perhaps! • Total population databank for the 3 million people of Wales • Multi source data (administrative, clinical, research) • Many nested e-cohorts within SAIL (such as WECC )
Data Linkage is the key! • Data linkage (at a person level) is essential to reap the benefits of routine data • Good quality data linkage needs some form of consistent personal identifiers on which to link • In the UK multi source data do not share a common ID number. Names, Address, Date of Birth, ARE however, normally collected.
In the beginning . . . • There was a real opportunity to create this data resource • We had established how linked, routine data was useful for research • We knew there were numerous technical (computing) challenges to overcome • But the idea required data owners / guardians to feel able to provide their data to SAIL. • Constructing the circumstances that enabled data guardians to supply data to SAIL become the single biggest challenge!
The issues facing SAIL • Data guardians across Wales: • Wanted to participate and saw the potential benefits • Were nervous of breaching the Data Protection Act • Did not have clear guidelines that helped • Needed a way of guaranteeing the privacy of their data • Were nervous about the uses to which the data might be put • Wanted access to be controlled (in some way)
The issues facing SAIL • Researchers (across the UK) wanted: • As much data as they wanted, whenever they wanted it • To avoid detailing what they wanted to do • Data delivered to them • Data to arrive quickly • No admin, no approvals, no constraints • Clean, easy and consistent data • Simple, flat data structures
Our response • Set a series of objectives • Undertook pilot work • Consulted very widely • Understood relevant legislation and good practice guidance (Information Governance) • Developed the approaches over time • Continued to consult and have external inspection • Continuous improvement process
The initial IG challenge • Matching up the same people in different datasets (data linkage) is very inaccurate without access to identifiers • (Matching with imperfect identifiers is still a challenge!) • Sophisticated but standardised matching was therefore required. • Data owners felt able to part with “anonymised at source” data. However including identifiers in the supply was seen to be illegal without consent
Setting out • Pilot to prove the concept • One health economy area – Swansea (pop. c. 250k) • Data General Practices (36), Patient Episode Database Wales (PEDW) and social services data extracts • Purpose: to develop, review, refine technical and procedural methodologies.
Setting out • Consultations with regulatory and professional bodies (local and national) • Suitability of system • In the public interest • Protection of patient privacy • Ethics and governance • Usefulness to enhance research and inform policy • Value for money • Exhaustive (and exhausting!) efforts • File of evidence of acceptability
The base level • Response: development of “Split-file anonymisation” technique • Using the “separation principle” • No flow of identifiable information to SAIL • No flow of identifiable confidential information to ANYONE • Clear, written, formal data sharing agreements • Clarity about use cases, conditions and exception clauses
Other design constraints • A pledge to data providers that no data will ever leave the databank • They can ask for it to be deleted • They know who has accessed it • They know what it has been used for
Validate Construct ALF HIRU (Blue C) Health Solutions Wales Data Provider Other recombined data Anonymisation process Demographic data only Validated, anonymised data Recombine Encrypt and load Clinical / activity data Operational system HIRU (Blue C) HIRU methodology (illustration)
Available Computing infrastructure • Blue C supercomputer, one of the fastest computers in Europe dedicated to Life Science research • Strategic partnership with IBM (through School of Medicine’s Institute of Life Sciences initiative) • Advanced software toolset (database, data mining, GIS)
Objectives • Secure data transportation • Reliable matching process • Anonymisation and encryption • Disclosure control • Data access controls • Scrutiny of data utilisation proposals • External verification of compliance with IG
Objective 1 • Secure data transportation • Data transported using HTTPS (Hyper-Text Transfer Protocol Secure) • DPOs split datasets at source • Clinical details to HIRU (none to HSW) • Demographics to HSW for matching and anonymisation • Brown Envelope principle) • Linking key – re-join after anonymisation
Objective 2 • 2) Reliable matching process • Partnership with Trusted Third Party – HSW • HSW = NHS Agency with right to hold identifiers for NHS admin purposes • Use the Welsh Demographic Service administrative register as gold standard • MACRAL (Matching Algorithm for Consistent Results in Anonymous Linkage) - SQL-based algorithm – sequential passes • Deterministic and probabilistic record linkage
MACRAL • Exact match on valid NHS number • Exact match on firstname, surname, d.o.b, gender, postcode • Soundexing • Lexicon matching • Assigns match probability on Bayesian model • Informs analysts
Validation and optimisation • Firstly – • Validation exercise • Obtained specificity values >99.8% and sensitivity > 94.6% with error rates <0.2% Then – • Optimised techniques for matching a variety of datasets: primary care (GP), hospital/secondary care (PEDW), and social care (PARIS)
Objective 3 • 3) Anonymisation and encryption • Anonymous Linking Field (ALF) • One person – one ALF • Aggregation and categorisation • Further processing at HIRU • Into ALF_E • Recombination
Objective 4 • 4) Disclosure controls • Assessment of Uniques and low-copy numbers • Data reduced to minimum required for study • Operated at various stages: • When the data view is created • Before dissemination • Numerical Evaluation of Multiple Outputs • Combination of expert review and machines processes
Numerical Evaluation of Multiple Outputs • NEMO • SQL-based algorithm • Counts unique and low-copy number records • Allows the judicious application of suppression and/or aggregation • Manual review • Linkage/Homogeneity attack
Objective 5 • 5) Data access controls • Technical and permission-based control • Policies and Standard Operating Procedures (SOPs) • User agreements – clarity + penalties • Project-based approvals and linked access • Physical restrictions - technology • Time-limited, specific data views per approved project • SAIL Gateway
SAIL Gateway: Critical features • Firewalled network • Windows XP Desktops one per user running in a virtualised environment (VPN) • All desktop and server members of active directory and specific group policies applied • Only remote desktop (RDP) allowed through firewall to the windows XP desktops • Localised file storage for windows XP desktops both private and shared between desktops within the Gateway • Ability to host application servers within environment • Automated one-way transfer of data into the environment • Authorised limited transfer of data out of the environment
Objective 6 • 6) Scrutiny of data utilisation proposals • Collaboration Review System – applies to all uses • Information Governance Review Panel (IGRP) • British Medical Association • Public Health Wales • National Research Ethics Service • Informing Healthcare • Involving People
Objective 7 • 7) External verification of compliance with IG (Audit) • Important to: • Reassure DPOs and other partners • Gain recommendations for improvement • Conduct: • Policies and SOPs • Interviews • System verification
Data Users Project View Project Request IGRP HIRU & IGRP SOPs and Policies Disclosure control HIRU Access controls Views SAIL databank Masking and encryption HSW Anonymisation service Data Sources National Datasets NHS Social care Others The SAIL system
Subsequent refinements • Role based access • Technical, Senior Analyst, Approved Analyst, User / statistician, HSW technical • SAIL Gateway • Uploading, tool selection, performance • Results out / approvals • Wiki, help, training materials, code of conduct, messaging • Tighter user agreements (line management sign off) & Clearer sanctions for misuse • Purpose-built virtual IGRP committee technology
Data • Data on 3 million people, ≈ 2 billion records, and growing! • Historical data 5 – 20 years • Maintains address history for full period (exposures) • Most codified using ICD, Read codes, OPCS codes, SNOMED, etc. • Many hundreds of separate data suppliers • Free text a real (IG) challenge • Unknown use of identifiers • Potential for ‘risky’ comments • Hard to analyse in quantity • Now a major work stream
National datasets - examples • PEDW - in-patients & day cases and out-patients • National Community Child Health Database • NHS Direct Wales • Cancer incidence registry for Wales • National screening programmes • Congenital abnormalities • Ambulance service data • National Pupil Database (performance and attendance of children at School) And much more…
Local datasets - examples • General Practice • Pathology • A&E departments • Social services • Local authority housing data – RALFs • And more….
Research datasets • Data collected as part of research studies where the aim is to use routine data as well • Permissions, consent and regulatory approvals • Do not release SAIL data to researchers to link to study datasets • Treat as dataset from DPO – study dataset anonymised and loaded into SAIL for linkage with SAIL data
Clinical systems • Introduced new clinical systems to send data direct to SAIL (via standard mechanisms) • Working with NHS Wales to introduce new national systems • SAIL now central part of the NHS’s “secondary uses” approaches – new data from new national systems e.g. - radiology, pathology, emergency, etc.
Other advancements • Data collected directly from the people of Wales (and beyond) via internet portals. Currently disease cohort specific, moving to all-Wales • SAIL data now linked to local histopathology sample archive (tissue bank), with potential to link to national cancer (tissue) bank • Flow of imaging data (MRI, ECG, etc.) from local NHS providers • Set up of a public advocates group • Linkage of national (cross-sectional) surveys – consent issues • Genomics data under consideration (special IG issues!) • Increasingly used by NHS to monitor and plan services – change of use • Residential Linking Fields (RALFs) . . .
RALFs • Desire to know more about: • The properties people live in (characteristics, proximity to geographical features) • Who they live with (household relationships, familial relationships etc) • A real problem to do while maintaining anonymity • Our Solution: RALFs • An ALF has a RALF, all RALFs have 1+ ALFs (usually)
Residential Anonymous Linking Fields - RALFs HIRU HSW • a. Create environment metrics OS Data b. KEY and addresses with environment metrics HIRU GIS WDS c. Match incoming address data and attach RALFs d. RALFs and environment metrics Encrypt Encrypt e. Combination of RALFs with ALFs SAIL
Summary • Privacy is not just about the individual – it sometimes relates to the organisation • Preserving privacy reduces research utility • Finding the balance between privacy protection and research utility is the key • There is no perfect balance
Thanks Data providers - NHS organisations, local authorities and government agencies, and more Health Solutions Wales NHS Wales Informatics Service National Institute for Social Care and Health Research Welsh Government Information Governance Review Panel Researchers of Wales and beyond And to you for listening!