440 likes | 591 Views
Data quality/usability and population -based biobanks. Paul Burton Dept of Health Sciences Dept of Genetics University of Leicester. Structure of talk. Why does data quality/usability matter? UK Biobank as an illustration
E N D
Data quality/usability and population-based biobanks Paul Burton Dept of Health Sciences Dept of Genetics University of Leicester
Structure of talk • Why does data quality/usability matter? • UK Biobank as an illustration • Statistical power of nested case-control studies • Expected event rates in UK Biobank • Biobank harmonisation • Conclusions
Epidemiological analysisat its simplest • Odds ratio (OR) = (120*240)/(200*100) = 1.44 [1.04 – 2.0] • May also adjust for a confounder • e.g. high saturated fat intake [y/n] • What is the impact of error in an outcome or an explanatory variable or in a confounder?
Systematic error • Some disease free smokers deny smoking • Odds ratio (OR) = (120*250)/(190*100) = 1.58
Random error • At random, 10% of subjects state their exposure incorrectly • Odds ratio (OR) = (118*236)/(204*102) = 1.34
The impact of errors • Systematic errors in outcome or explanatory variables systematic bias in either direction • True OR = 2 estimated OR = e.g.1.5 or 2.7 • Random errors in binary outcomes or any explanatory variables shrinkage bias • True OR = 2 estimated OR = e.g.1.5 • Random errors in confounding variables systematic bias in either direction • True OR = 2 estimated OR = e.g.1.5 or 2.7
Errors in biobanks • Random errors • Loss of power is primary problem • Biobank sample sizes very large, so why is there a problem?
Errors in biobanks • Random errors • But: why are biobank sample sizes so large? • NB Biobanks very large not nested case-control studies • Need to detect small relative risks (e.g. OR=1.3) • Power generally limited (see later) • Small error effects catastrophic • Apparent causal effects easily created or destroyed
Errors in biobanks • Systematic errors • Small real effects a major issue again • Must understand data collection protocols, and must attempt to optimise those protocols • UK Biobank • P3G Observatory
Basic design features • A prospective cohort study • 500,000 adults across UK • Middle aged (40-69 years) • A population-based biobank • Not disease or exposure based • Recruitment via electronic GP lists • “Broad spectrum” not “fully representative” • Individuals not families • MRC, Wellcome Trust, DH, Scottish Executive • £61M
Basic design features • Longitudinal health tracking • Nested case-control studies • Long time-horizon • Owned by the Nation • Central Administration – Manchester • PI: Prof Rory Collins - Oxford • 6 collaborating groups (RCCs) of university scientists
Focus on power of nestedcase-control analyses • Likely to be very common analyses • Power limiting
Issues that are often ignored in standard power calculations • Multiple testing/low prior probability of association* • Interactions* • Unobserved frailty • Misclassification* • Genotype • Environmental determinant • Case-control status • Subgroup analyses* • Population substructure
Power calculations • Work with least powerful setting • Binary disease, binary genotype, binary environmental exposure • Logistic regression analysis; interactions = departure from a multiplicative model • Complexity (arbitrary but reasonable)
Summarise power using “Minimum Detectable Odds Ratios” (MDORs) calculated by ‘iterative simulation’ • Estimate minimum ORs detectable with 80% power at stated level of statistical significance under specified scenario
Whole genome scan • Genetic main effect, p<10-7
Gene:environment interaction • 20,000 cases
Summary – rule of thumb • 80% power for genotype frequency = 0.1, (allele frequency 0.05 under dominant model) • Genetic main effect 1.5, p=10-4 5,000 cases • Genetic main effect 1.3, p=10-4 10,000 cases • Genetic main effect 1.2, p=10-4 20,000 cases • Genetic main effect 1.4, p=10-7 10,000 cases • Genetic main effect 1.3, p=10-7 20,000 cases • G:E interaction with environmental exposure prevalance = 0.2 2.0, p=10-4 20,000 cases
Taking account of • Age range at recruitment 40-69 years • Recruitment over 5 years • All cause mortality • Disease incidence (“healthy cohort effect”) • Migration overseas • Comprehensive withdrawal (max 1/500 p.a.)
Interim conclusions • Having taken account of realistic bioclinical complexity, UK Biobank is just large enough to be of great value as a stand-alone research infrastructure • Data quality, in particular errors in outcome or explanatory variables, or in confounders is crucial • Its value will be greatly augmented if it proves possible to set up a coherent and scientifically harmonized international network of Biobanks and large cohort studies
Why harmonise? • Basic aim is to enable and promote data pooling, in a manner that recognises and takes appropriate account of systematic differences between studies.
Why harmonise? • Investigate less common (but not rare) conditions • UKBB: Ca stomach 2,500 cases in 29 years • 6 UKBB equivalents: 10,000 cases in 20 years • Investigate smaller ORs • GME 1.5 1.2 requires 5,000 20,000 • 4 UKBB equivalents • Analysis based on subsets – homogeneous classes of phenotype, or e.g. by sex
Why harmonise? • Earlier analyses • UKBB: Alzheimers disease, 10,000 cases in 18 yrs • 5 UKBB equivalents 9 years • Events at younger ages • Broad range of environmental exposures • Aim for 4-6 UKBB equivalents • 2M – 3M recruits
Harmonisation initiatives • Public Population Program in Genomics (P3G) • Canada + Europe • Tom Hudson, Bartha Knoppers, Leena Peltonen, Isabel Fortier ….. • Population Biobanks • FP6 Co-ordination Action (PHOEBE – Promoting Harmonisation Of Epidemiological Biobanks in Europe) • Camilla Stoltenberg, Paul Burton, Leena Peltonen, George Davey Smith …..
Harmonisation in the P3G Observatory(from Isabel Fortier) • Description • Comparison • Harmonisation • Data quality crucial at every stage
Final conclusions • Power of individual biobanks is limited • Minimisation of measurement error is crucial • Harmonisation is crucial if we are to optimise the value of biobanks internationally • Harmonisation depends on a full understanding of allaspects of data quality
Rarer genotypes • Genetic main effects
Gene:environment interaction • 10,000 cases
Hattersley AT, McCarthy MI. A question of standards: what makes a good genetic association study? Lancet 2005; in press.
Summarise power using MDORs calculated by ‘iterative simulation’ • Want minimum ORs detectable with 80% power at stated level of statistical significance • 1. Guess starting values for ORs • 2. Simulate population under specified scenario • 3. Sample required number of cases and controls • 4. Analyse resultant case-control study in standard way • 5. Repeat 2,3,4 1,000 times • 6. Use empirical statistical power results from the 1,000 analyses to update ORs to new values expected to generate a power of 80% • Repeat 2-6 till all ORs have 80% power
Taking account of • Age range at recruitment 40-69 years • Recruitment over 5 years • All cause mortality • Disease incidence (“healthy cohort effect”) • Migration overseas • Comprehensive withdrawal (max 1/500 p.a.) • Partial withdrawal (c.f. 1958 Birth Cohort)