420 likes | 436 Views
Big Data in UK Biobank: Opportunities and Challenges Funders: Wellcome Trust and Medical Research Council, with Department of Health, Scottish & Welsh Governments, British Heart Foundation and Diabetes UK. Rory Collins UK Biobank Principal Investigator
E N D
Big Data in UK Biobank:Opportunities and ChallengesFunders: Wellcome Trust and Medical Research Council,with Department of Health, Scottish & Welsh Governments, British Heart Foundation and Diabetes UK Rory Collins UK Biobank Principal Investigator BHF Professor of Medicine & Epidemiology Nuffield Department of Population Health University of Oxford, UK
UK Biobank Prospective Cohort 500,000 UK men and women aged 40-69 years when recruited and assessed during 2006-2010 Extensive baseline questions and measurements, with stored biological samples (and opportunities to add enhanced assessments in large subsets) Repeat assessments over time in subsets of the participants to allow for sources of variation General consent for follow-up through all health records and for all types of health research Sufficiently large numbers of people developing different conditions to assess causes reliably
Need for prospective studies to be LARGE: CHD versus SBP for 5K vs 50K vs 500K people in the Prospective Studies Collaboration (PSC) 5000 people 50,000 people 500,000 people Age at risk: 256 256 256 80-89 Age at risk: 80-89 128 128 128 70-79 70-79 Age at risk: 64 64 64 60-69 80-89 60-69 32 32 32 50-59 70-79 50-59 60-69 16 16 16 40-49 8 8 8 40-49 50-59 40-49 4 4 4 2 2 2 1 1 1 120 140 160 180 120 140 160 180 120 140 160 180 Usual SBP (mmHg) Usual SBP (mmHg) Usual SBP (mmHg)
Locations ofUK Biobank assessment centres around the UK (with people recruited from urban and rural areas)
UK Biobank: 500,000 participants aged 40-69 recruited in 2007-10 Generalisability (not representativeness): Heterogeneity of study population allows associations with disease to be studied reliably
Production line baseline assessment visit(improved throughput; efficient staffing)
Baseline assessment: Questionnaire content Self-completion: topics Median time (minutes) Socio-demographics 1.7 Ethnicity 0.1 Work-employment 1.4 Physical activity 4.4 Smoking (non-smokers) 0.5 (past/current smokers) 1.5 Diet (food frequency)* 4.5 Alcohol 1.1 Sleep 1.2 Sun exposure 1.3 Environmental exposures 1.0 Early life factors 0.8 Family history of common diseases 1.6 Reproductive history & screening (women) 2.4 (men) 0.8 Sexual history 0.4 General health 2.1 Past medical history & medications 1.6 Noise exposure 1.0 Psychological status 4.5 Cognitive function tests 10.0 Hearing speech-in-noise test 8.0 Total time 52.5 Interview: topics Median time (minutes) Medical history/medication 3.1 Occupation 0.4 Other 0.6 Total time 4.1 *Subset of 200,000 participants: repeated daily diet diaries conducted via the internet Touchscreen and interview questions (plus extra enhancement questions) available at www.ukbiobank.ac.uk
All 500,000 participants Blood pressure & heart rate Height (standing/seated) Waist/hip circumference Weight/impedance Spirometry Heel ultrasound Subset: 175,000 participants Hearing test Vascular reactivity Subset: 120,000 participants Visual acuity, refractive index & intraocular pressure Subset: 85,000 participants Retinal images & optical coherence tomograms Fitness test & ECG limb leads Baseline assessment: Physical measurements (with enhanced measures in large subsets)
UK Biobank different types of biological sample:allowing a wide range of different assays
Further enhancements of the phenotyping of UK Biobank participants currently being conducted • Web-based assessments of diet completed
Web-based dietary assessment: 24-hr recall • Design considerations: • Easy and quick: takes only 10-15 minutes • Automated data collection and coding • Repeatable (capturing seasonal variation) • Detailed enough to estimate nutrient intake • Over 200,000 participants completed the questionnaire at least once, and about 90,000 did so more than once
Future web-based assessments for exposures • Cognitive function • Repeat assessment of baseline measures • Broaden cognitive phenotyping with new measures • Complements enhanced cognitive function assessment that is planned for the imaging assessment visit • Occupational history • Information about all previous occupations (not just latest) • Greater detail on type of work and duration • Physical activity questionnaire (RPAQ) • Complement data from activity monitor
Further enhancements of the phenotyping of UK Biobank participants currently being conducted • Web-based assessments of diet completed; and next to be cognition/mental health (2014) • Wrist-worn accelerometers to be mailed to all participants who agree to wear one (2013-15)
UK Biobank wrist-worn accelerometer • ~45% of participants agree to wear one • Willing participants sent device by mail • It is to be worn continuously for 7 days • Returned by mail and data downloaded • Device cleaned and sent to next participant • 100K participants from mid-2013 to mid-2015 (50,000 complete data-sets already obtained)
Further enhancements of the phenotyping of UK Biobank participants currently being conducted • Web-based assessments of diet completed; and next to be cognition/mental health (2014) • Wrist-worn accelerometers to be mailed to all participants who agree to wear one (2013-15) • Biobank chip to genotype (GWAS; candidate SNPs; exome) all participants (2013-15)
Genotyping of all UK Biobank participants • 820K bespoke UK Biobank Affymetrix genotyping chip: • 250,000 SNPs in a whole-genome array • 200,000 markers for known risk factor or disease associations, copy number variation, loss of function, and insertions/deletions • 150,000 exome markers for high proportion of non-synonymous coding variants with allele frequency over 0.02% • Estimate (“impute”) additional genotypes by combining measured genotypes with reference sequence data • Researchers can study associations of genotype data with biochemical risk factors and detailed phenotyping from baseline assessment, along with health outcomes
Further enhancements of the phenotyping of UK Biobank participants currently being conducted • Web-based assessments of diet completed; and next to be cognition/mental health (2014) • Wrist-worn accelerometers to be mailed to all participants who agree to wear one (2013-15) • Biobank chip to genotype (GWAS; candidate SNPs; exome) all participants (2013-15) • Standard panel of assays (e.g. lipids; clotting) on samples from all participants (2014-15)
Rationale for assaying many standard markers in baseline samples from all 500,000 participants • Cost-effective way of increasing the usability of the resource for researchers, by providing data for: • Cross-sectional analyses with prevalent disease • Identification of subsets based on assay values • Conducting these assays in all of the participants at the same time should facilitate good quality control • Lower cost for conducting all of these assays at one time rather than in multiple retrievals and assays • Facilitates management of depletable samples
Consideration of a proposal to conduct assays of biomarkers of infectious disease in all participants • Request from the international research community to facilitate studies of the associations of infectious agents with disease (in particular, different types of cancer) • Plan would be to assay a panel of infectious agents (e.g. HPV, Hepatitis B & C, HBV, EBV, H. pylori) in the baseline sample collected from all 500,000 participants • As with the biochemical and genetic assays that are being conducted, assays of a wide range of infectious agents would increase the efficient use of the resource • Detailed proposal for funding is now being developed
Further enhancements of the phenotyping of UK Biobank participants currently being conducted • Web-based assessments of diet completed; and next to be cognition/mental health (2014) • Wrist-worn accelerometers to be mailed to all participants who agree to wear one (2013-15) • Biobank chip to genotype (GWAS; candidate SNPs; exome) all participants (2013-15) • Standard panel of assays (e.g. lipids; clotting) on samples from all participants (2014-15) • Information from multiple imaging modalities (e.g. brain/heart/body MRI; bone/joint DEXA)
Imaging of 100,000 UK Biobank participants • MRI of brain, heart and abdomen • DEXA of bones, joints and body • Ultrasound of carotid arteries • Shortened baseline assessment plus more detailed cognitive function tests and ECG to detect rhythm disturbances Pilot phase: 4-6,000 people in 1 centre (2014-15) Main phase: 95,000 people in 3 centres (2015-19) Opportunities for repeat imaging in sub-sets (e.g. as part of MRC’s focus on dementia)
Body Mass Index (BMI) vs Heart Disease and Stroke (PSC:1M people followed for 12 years; Lancet 2009) At BMI >25:5 units higher BMI associated with ~40% higher IHD & stroke mortality At BMI <25:positive association continues for IHD, but not for stroke 160 Heart disease(18 237 deaths) 80 Annual deaths per 1000 (floated so mean = PSC rates at age 65-69) 40 Stroke(6122 deaths) 20 10 15 20 25 30 35 40 50 Baseline BMI (kg/m2) Adjusted for age, sex, smoking & study; first 5 years of follow-up excluded
Similar age, gender, BMI & % body fat, but different amounts of INTERNAL FAT 5.86 litres of internal Fat 1.65 litres of internal fat
Atrial fibrillation (AF): prevalence and mortalityduring the period between 1993 and 2007 Prevalence: increasing Mortality: little change Piccini et al. Circulation: Cardiovascular Quality and Outcomes. 2012
Consideration of prolonged cardiac monitoring • Cardiac arrhythmias (especially AF) • can indicate significant underlying cardiac disease • can directly cause significant morbidity and mortality • important risk factors for cardio-embolic events (esp. stroke) • Detection requires prolonged monitoring • many are intermittent (e.g. paroxysmal AF) • substantial under-detection with standard 12 lead ECG • AF increases with age (<50 years: <1%; >80 years: 10%+) • No large-scale population-based prospective studies with prolonged monitoring, so the full extent/impact of AF on health outcomes is likely to have been underestimated
Example of device for prolonged arrhythmia detection iRhythmZio Patch • Has been used in 18,000 people • Non-invasive stick-on patch • Comfortable (median wear 12 days) • Can be applied in clinic or at home • Beat-to-beat ECG recording • Validated against reference Holter • Potentially recyclable device chip which stores data for downloading Planning to pilot feasibility and acceptability during imaging pilot
UK Biobank: Centralised follow-up of health • Death and cancer registries • In-patient and out-patient hospital episodes (including psychiatric) and related procedure registries • Primary care records of health conditions, prescriptions, diagnostic tests and other investigations • Other health-related: disease registries; dispensing records; imaging; screening; dental records • Direct to participants: self-reported medical conditions; treatments actually being taken; degree of functional impairment; cognitive and psychological scores
Health outcome data-linkage challenges • Regulation, bureaucracy, and permissions (despite explicit consent from participants) • Data transfer, matching and coding queries • Understanding different data structures • Mapping between coding systems • Mapping between different countries • Presenting outcome data to researchers • Original outcome codes • Post-adjudication outcomes
Progress with UK-wide linkage to outcome data (both before and after baseline assessment)
Meaning of coded data from health records • What do the coded data actually tell us? • Characteristics of coded data • How accurate? • How detailed? • How complete? • Do we need to go beyond the coded data?
UK Biobank: Expected numbers of participants developing diseases during long-term follow-up
General strategy for outcome adjudication • Avoid false positive cases (but tolerate some false negatives) • Geographical generalisability • Cost-effectiveness • Future-proofed • Scalability • Staged approach: • Ascertain • Confirm • Classify
Expert Working Groups developing protocols for ascertainment, confirmation and classification • Cancer • Diabetes • Cardiac outcomes • Stroke • Mental health outcomes • Ocular outcomes • Neurodegenerative outcomes • Respiratory outcomes • Musculoskeletal outcomes Pilots progressing well; preparing for scaling up of algorithms and then for web adjudication Pilots commencing Pilots being developed
UK Biobank: Principles of Access • UK Biobank is available to all bona fide researchers for all types of health-related research that is in public interest • No preferential or exclusive access (and, in particular, access does not involve “collaboration” with UK Biobank) • Researchers have to pay for access to the Resource for their proposed research on a cost-recovery basis only • Access to the biological samples that are limited and depletable will be carefully controlled and coordinated • Researchers are required to publish their findings and return the data so that other researchers can use them
“Showcase”: e-catalogue of data itemscurrently in the UK Biobank Resource(www.ukbiobank.ac.uk)
Showcase supports search strategies for data items in the UK Biobank Resource
Preliminary applications subdivided by type of researcher, location and type of research
What makes UK Biobank special? • PROSPECTIVE: It can assess the full effects of a particular exposure (such as smoking) on all types of health outcome (such as cancer, vascular disease, lung disease, dementia) • DETAILED: The wide range of questions, measures and samples at baseline allows good assessment of exposures, and outcome adjudication allows good disease classification • BIG: Inclusion of large number of participants allows reliable assessment of the causes of a wide range of diseases, and of the combined impact of many different exposures Unique combination of BREADTH and DEPTH