300 likes | 479 Views
Large-scale association studies: brute force and ignorance. Thomas Lumley BIOINF 744/STATS 771. My experience: the CHARGE consortium. I’m in CHARGE. I’m in CHARGE. I’m in CHARGE. I’m in CHARGE. I’m in CHARGE. I’m in CHARGE. I’m in CHARGE. I’m in CHARGE. I’m in CHARGE. I’m in CHARGE.
E N D
Large-scale association studies:brute force and ignorance Thomas Lumley BIOINF 744/STATS 771
My experience: the CHARGE consortium I’m in CHARGE I’m in CHARGE I’m in CHARGE I’m in CHARGE I’m in CHARGE I’m in CHARGE I’m in CHARGE I’m in CHARGE I’m in CHARGE I’m in CHARGE I’m in CHARGE
Genome-wide Association Studies • If a protein is important in a particular disease or trait then small changes in function or expression of that protein should have a small effect on the disease or trait. • A small effect. Very small. No, smaller than that. • But not confounded with environmental or lifestyle factors : like a tiny randomised experiment • And not relying on prior biological knowledge • we can find surprises. • Because SNP associations are very weak, need 103-105 people in the study
Does it work? • Not for risk prediction or treatment choice • exception: some adverse drug reactions • Yes, for discovering new mechanisms or potential drug targets • mystery 9p21 variant in CHD • autophagy in Crohn’s Disease • sodium transporter affecting uric acid levels • new ion channels important in heart rhythm
Measurement technologies • SNP chips • cDNA attached to glass/silicon • multiple probes per SNP • planned layout (Affymetrix) or random (Illumina) • DNA binds to cDNA, fluorescent tags for readout • Off-the-shelf • 105 to 107 SNPs, genome-wide coverage • Custom chips • 384 to 250000 SNPs for a particular purpose
Scale • SNP chips are cost-effective only for large sample sizes and numbers of SNPs • new `exome chip’ has all known coding variants segregating in the population • 1.5 million chips sold • a few hundred dollars in large volumes = dozens of SNPs per 1c.
Quality control homozygote heterozygote homozygote
Quality control ? ? ? ? ?
Quality control • Easy by hand, but there are 500000 of them • 1 minute each=24hrs/day for a year. • Need to be brutal and automated • 10% of SNPs discarded is not unusual • Batch effects • Missingness (per SNP and per sample) • Hardy-Weinberg equilibrium • low minor allele frequency (egNp<100) • Big differences from expected allele frequency
HWE • Not looking at population structure here • Bad SNPs tend to drop out either heterozygotes or rare-allele homozygotes • Calling error leads to massive HWE violations • p-value <10-5 is a common standard
Batch effects • Experimental design is important • mix cases and controls in same batches • especially important with new technologies After online publication of our report “Genetic Signatures of Exceptional Longevity in Humans” (1) we discovered that technical errors in the Illumina 610 array and an inadequate quality control protocol introduced false positive single nucleotide polymorphisms (SNPs) in our findings. -- Sebastiani et al. Science retraction.
Analyses • Must be simple and fast • Usually additive genetic model • Adjust for • sampling factors such as recruitment site • precision variables (eg heart rate in QT interval) • age and sex (because epidemiologist can’t stop themselves) • population structure summaries • not for any post-conception exposures that don’t affect genes.
Results number of zeroes in p-value (need 7-8) One dot for each SNP, ordered by position within chromosome
Meta-analysis • Most genome-wide studies involve multiple samples • Usually share results, not individual data • combine by precision-weighted meta-analysis • no loss of efficiency for single-parameter analyses
Computation: not a big deal • In R, roughly 12 cpu-hrs for quantitative traits, 36 cpu-hrs for binary, time-to-event • Parallelises very well • we split by chromosome • Limited by disk bandwidth • eg, six parallel R sessions on a cheap eight-core server • eg, 500 parallel R sessions on high-quality supercomputer
Population structure • Full Bayesian modelling is too slow at this scale • Use first few principal components of the genotype correlation matrix • population structure is a concern because it leads to systematic variation in allele frequencies along the whole genome • systematic variation in allele frequencies along the whole genome shows up in principal components
Principal components • Genotype matrix G has 106 columns, 104 rows • don’t want to form GTG, with 1012 entries • work with GGT, with 108 entries • first few eigenvectors are population structure components (or common inversions) • ‘EIGENSTRAT’ was first program to do this • Reduce effort further by using just 105 or 104 random SNPs (some loss in quality)
Principal components • Does it work? • if not, ancestry-informative loci would be over-represented in association findings • largely not the case • slight suggestion in very largest studies that ABO blood group and lactase persistence loci are cropping up too often.
Imputation • Meta-analysis often involves studies using different SNP chips • Can only combine results for the same SNPs • usually a minority • Imputation allows everyone to use the same SNPs • Based on linkage disequilibrium • with 500,000 SNPs, we are very far from linkage equilibrium
Imputation • Haplotyping • estimate possible haplotypes and their probabilities for each person in your sample • In reference panel with all the SNPs (egHapMap) • look up which allele is on each haplotype • Compute posterior mean genotype
Imputation • Imputation does not use phenotype data • slightly underestimates association • but only for SNPs that explain a large fraction of variation in phenotype • which basically don’t exist. • Just plug imputed genotype into regression as if it was measured. • some people filter out SNPs where imputation is low-quality: compare to 2p(1-p)
Imputation • For meta-analysis, need to impute to the same set of SNPs before analysis • most people us 2.5 million HapMap Phase II SNPs • starting to use 38 million 1000 Genomes SNPs • for additive genetic model, doesn’t matter whether SNPs are measured or imputed. • slightly more work needed for non-additive genetic models or SNP:SNP interaction models
Resequencing • 2.5 million SNPs is one per 1000 bases • Every base varies somewhere in the human population • Association studies by sequencing are just becoming possible • US$1000 genome probably coming next year
Resequencing • Basic idea is similar to GWAS, but • most variants will be rare • some variants will have stronger associations • the true functional variant will be measured. • For sufficiently-common SNPs, use the same analysis as in GWAS • For rare variants (SNPs and indels), use a burden test
Burden tests • Might expect most mutations to reduce function • people with more copies of rare variants should have lower function for that gene (or non-gene locus) • Use number of variants for each person as predictor in a regression model • rarer variants may have larger effects: give them more weight • we know or guess that some bases are more likely to matter: give them more weight
Omnidirectional burden tests • ‘Loss of function’ is tricky • ion channel function is to open and close: which direction is loss of function? • Leiden variant in Factor V removes ability to be turned off: loss or gain of function? • Would be nice to find important genes even if variants act both ways • Hard: huge increase in dimension of problem • Simple meta-analyses are no longer efficient.
Omnidirectional burden tests • Typically based on correlation • do people with more similar genotypes have more similar phenotypes? • Power is very low if there are many unimportant variants
Third generation sequencing • Pacific Biosciences: tethered polymerase copies single DNA molecule, with spotlight small enough to see just one base fluoresce • Oxford Nanopore: drag a single DNA strand through a tiny hole and measure its shadow • Ion Torrent: tethered polymerases copy one base at a time, read-out uses H+ ion released by adding the base.