1 / 44

Data quality/usability and population -based biobanks

Data quality/usability and population -based biobanks. Paul Burton Dept of Health Sciences Dept of Genetics University of Leicester. Structure of talk. Why does data quality/usability matter? UK Biobank as an illustration

lorin
Download Presentation

Data quality/usability and population -based biobanks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data quality/usability and population-based biobanks Paul Burton Dept of Health Sciences Dept of Genetics University of Leicester

  2. Structure of talk • Why does data quality/usability matter? • UK Biobank as an illustration • Statistical power of nested case-control studies • Expected event rates in UK Biobank • Biobank harmonisation • Conclusions

  3. Why does data quality/usability matter?

  4. Epidemiological analysisat its simplest • Odds ratio (OR) = (120*240)/(200*100) = 1.44 [1.04 – 2.0] • May also adjust for a confounder • e.g. high saturated fat intake [y/n] • What is the impact of error in an outcome or an explanatory variable or in a confounder?

  5. Systematic error • Some disease free smokers deny smoking • Odds ratio (OR) = (120*250)/(190*100) = 1.58

  6. Random error • At random, 10% of subjects state their exposure incorrectly • Odds ratio (OR) = (118*236)/(204*102) = 1.34

  7. The impact of errors • Systematic errors in outcome or explanatory variables  systematic bias in either direction • True OR = 2  estimated OR = e.g.1.5 or 2.7 • Random errors in binary outcomes or any explanatory variables  shrinkage bias • True OR = 2  estimated OR = e.g.1.5 • Random errors in confounding variables  systematic bias in either direction • True OR = 2  estimated OR = e.g.1.5 or 2.7

  8. Errors in biobanks • Random errors • Loss of power is primary problem • Biobank sample sizes very large, so why is there a problem?

  9. Errors in biobanks • Random errors • But: why are biobank sample sizes so large? • NB Biobanks very large not nested case-control studies • Need to detect small relative risks (e.g. OR=1.3) • Power generally limited (see later) • Small error effects catastrophic • Apparent causal effects easily created or destroyed

  10. Errors in biobanks • Systematic errors • Small real effects a major issue again • Must understand data collection protocols, and must attempt to optimise those protocols • UK Biobank • P3G Observatory

  11. What is UK Biobank?

  12. Basic design features • A prospective cohort study • 500,000 adults across UK • Middle aged (40-69 years) • A population-based biobank • Not disease or exposure based • Recruitment via electronic GP lists • “Broad spectrum” not “fully representative” • Individuals not families • MRC, Wellcome Trust, DH, Scottish Executive • £61M

  13. Basic design features • Longitudinal health tracking • Nested case-control studies • Long time-horizon • Owned by the Nation • Central Administration – Manchester • PI: Prof Rory Collins - Oxford • 6 collaborating groups (RCCs) of university scientists

  14. Statistical powerand sample size

  15. Focus on power of nestedcase-control analyses • Likely to be very common analyses • Power limiting

  16. Issues that are often ignored in standard power calculations • Multiple testing/low prior probability of association* • Interactions* • Unobserved frailty • Misclassification* • Genotype • Environmental determinant • Case-control status • Subgroup analyses* • Population substructure

  17. Power calculations • Work with least powerful setting • Binary disease, binary genotype, binary environmental exposure • Logistic regression analysis; interactions = departure from a multiplicative model • Complexity (arbitrary but reasonable)

  18. Summarise power using “Minimum Detectable Odds Ratios” (MDORs) calculated by ‘iterative simulation’ • Estimate minimum ORs detectable with 80% power at stated level of statistical significance under specified scenario

  19. Genetic main effects

  20. Whole genome scan • Genetic main effect, p<10-7

  21. Gene:environment interaction • 20,000 cases

  22. Summary – rule of thumb • 80% power for genotype frequency = 0.1, (allele frequency  0.05 under dominant model) • Genetic main effect  1.5, p=10-4 5,000 cases • Genetic main effect  1.3, p=10-4 10,000 cases • Genetic main effect  1.2, p=10-4 20,000 cases • Genetic main effect  1.4, p=10-7 10,000 cases • Genetic main effect  1.3, p=10-7 20,000 cases • G:E interaction with environmental exposure prevalance = 0.2  2.0, p=10-4  20,000 cases

  23. Effect of realistic data errors

  24. Expected event ratesin UK Biobank

  25. Taking account of • Age range at recruitment 40-69 years • Recruitment over 5 years • All cause mortality • Disease incidence (“healthy cohort effect”) • Migration overseas • Comprehensive withdrawal (max 1/500 p.a.)

  26. No need to contact subjects

  27. Smaller sample sizes

  28. Interim conclusions • Having taken account of realistic bioclinical complexity, UK Biobank is just large enough to be of great value as a stand-alone research infrastructure • Data quality, in particular errors in outcome or explanatory variables, or in confounders is crucial • Its value will be greatly augmented if it proves possible to set up a coherent and scientifically harmonized international network of Biobanks and large cohort studies

  29. Harmonising biobanks internationally

  30. Why harmonise? • Basic aim is to enable and promote data pooling, in a manner that recognises and takes appropriate account of systematic differences between studies.

  31. Why harmonise? • Investigate less common (but not rare) conditions • UKBB: Ca stomach 2,500 cases in 29 years • 6 UKBB equivalents:  10,000 cases in 20 years • Investigate smaller ORs • GME 1.5  1.2 requires 5,000  20,000 • 4 UKBB equivalents • Analysis based on subsets – homogeneous classes of phenotype, or e.g. by sex

  32. Why harmonise? • Earlier analyses • UKBB: Alzheimers disease, 10,000 cases in 18 yrs • 5 UKBB equivalents  9 years • Events at younger ages • Broad range of environmental exposures • Aim for 4-6 UKBB equivalents • 2M – 3M recruits

  33. Harmonisation initiatives • Public Population Program in Genomics (P3G) • Canada + Europe • Tom Hudson, Bartha Knoppers, Leena Peltonen, Isabel Fortier ….. • Population Biobanks • FP6 Co-ordination Action (PHOEBE – Promoting Harmonisation Of Epidemiological Biobanks in Europe) • Camilla Stoltenberg, Paul Burton, Leena Peltonen, George Davey Smith …..

  34. Harmonisation in the P3G Observatory(from Isabel Fortier) • Description • Comparison • Harmonisation • Data quality crucial at every stage

  35. Final conclusions • Power of individual biobanks is limited • Minimisation of measurement error is crucial • Harmonisation is crucial if we are to optimise the value of biobanks internationally • Harmonisation depends on a full understanding of allaspects of data quality

  36. Extra slides

  37. Rarer genotypes • Genetic main effects

  38. Gene:environment interaction • 10,000 cases

  39. Hattersley AT, McCarthy MI. A question of standards: what makes a good genetic association study? Lancet 2005; in press.

  40. Summarise power using MDORs calculated by ‘iterative simulation’ • Want minimum ORs detectable with 80% power at stated level of statistical significance • 1. Guess starting values for ORs • 2. Simulate population under specified scenario • 3. Sample required number of cases and controls • 4. Analyse resultant case-control study in standard way • 5. Repeat 2,3,4 1,000 times • 6. Use empirical statistical power results from the 1,000 analyses to update ORs to new values expected to generate a power of 80% • Repeat 2-6 till all ORs have 80% power

  41. Taking account of • Age range at recruitment 40-69 years • Recruitment over 5 years • All cause mortality • Disease incidence (“healthy cohort effect”) • Migration overseas • Comprehensive withdrawal (max 1/500 p.a.) • Partial withdrawal (c.f. 1958 Birth Cohort)

  42. Necessary to contact subjects

More Related