1.04k likes | 2.72k Views
Genome-wide association studies (GWAS). Thomas Hoffmann. Outline. GWAS Overview Design Microarray Sequencing Which to use and other censiderations QC Analysis Population stratification adjustment Imputation Replication & Meta-analysis Limitations, missing heritability, and beyond.
E N D
Genome-wide association studies (GWAS) Thomas Hoffmann
Outline GWAS Overview Design Microarray Sequencing Which to use and other censiderations QC Analysis Population stratification adjustment Imputation Replication & Meta-analysis Limitations, missing heritability, and beyond
Candidate Gene or GWAS Genetic association studies (guilt by association) Hirschhorn & Daly, Nat Rev Genet 2005
GWAS Microarray Assay ~ 0.7 - 5M SNPs (keeps increasing) Affymetrix, http://www.affymetrix.com
Genotype calls Bad calls! Good calls!
Outline GWAS Overview Design Microarray Sequencing Which to use and other considerations QC Analysis Population stratification adjustment Imputation Replication & Meta-analysis Limitations, missing heritability, and beyond
One- and two-stage GWA designs Two-Stage Design One-Stage Design SNPs SNPs nsamples Stage 1 Samples Samples Stage 2 nmarkers
One-Stage Design SNPs Samples Two-Stage Design Joint analysis Replication-based analysis SNPs SNPs 1 1 Stage 1 Stage 1 Samples Samples Stage 2 Stage 2 2 2
Multistage Designs • Joint analysis has more power than replication • p-value in Stage 1 must be liberal • Lower cost—do not gain power • CaTs power calculator: http://www.sph.umich.edu/csg/abecasis/CaTS/index.html
Outline GWAS Overview Design Microarray Sequencing Which to use and other considerations QC Analysis Population stratification adjustment Imputation Replication & Meta-analysis Limitations, missing heritability, and beyond
Genome-wide Sequence Studies • Trade off between number of samples, depth, and genomic coverage. Goncalo Abecasis
Near-term sequencing design choices • For example, between: • Sequencing few subjects with extreme phenotypes: • e.g., 200 cases, 200 controls, 4x coverage. Then follow-up in larger population. • 10M SNP chip based on 1,000 genomes. • 5K cases, 5K controls. • Which design will work best…?
Outline GWAS Overview Design Microarray Sequencing Which to use and other considerations QC Analysis Population stratification adjustment Imputation Replication & Meta-analysis Limitations, missing heritability, and beyond
Design choices GWAS Microarray Only assay SNPs designed into array (0.7-5 million) Much cheaper (so many more subjects) GWAS Sequencing “De novo” discovery (particularly good for rare variants) More expensive (but costs are falling) (many less subjects) Need much more expansive IT support Lots of interesting interpretation problems (field rapidly evolving)
Design choices Exome Microarray Only assay SNPs designed into array (~300K+custom); in exons only and that could affect protein coding function Cheapest (so many more subjects) Exome Sequencing “De novo” discovery (particularly good for rare variants); %age of exons only More expensive than microarrays, less expensive than gwas sequencing Need more expansive IT support Lots of interesting interpretation problems
Size of study Visscher, AJHG 2012,
Size of study Visscher, AJHG 2012,
Outline GWAS Overview Design Microarray Sequencing Which to use and other considerations QC Analysis Population stratification adjustment Imputation Replication & Meta-analysis Limitations, missing heritability, and beyond
QC Steps Remove SNPs with low call rate (e.g., <97%) Proportion of SNPs actually called by software If it's low, the clusters aren't well defined, artifacts Remove those with low minor allele frequency? Rarer variants more likely artifacts / underpowered Exome arrays – rare variants are the whole point! Remove SNPs / Individuals who have too much missing data
QC Steps (2) SNPs that fail Hardy-Weinberg Suppose a SNP with alleles A and B has allele frequency of p. If random matting, then AA has frequency p*p AB has frequency 2*p*(1-p) BB has frequency (1-p)*(1-p) Test for this (e.g., chi-squared test) In practice do for homogeneous populations (more later)
QC Steps • Check genotype gender • Filter Mendelian inhertance (family-based, or potentially cryptics, if large enough sample) • Check for relatedness...
Check for relatedness, e.g., HapMap Pemberton et al., AJHG 2010
Outline GWAS Overview Design Microarray Sequencing Which to use and other considerations QC Analysis Population stratification adjustment Imputation Replication & Meta-analysis Limitations, missing heritability, and beyond
GWAS analysis • Most common approach: look at each SNP one-at-a-time • Additive coding of SNP most common, e.g., # of A alleles • Just a covariate in a regression framework • Dichotomous phenotype: logistic regression • Continuous phenotype: linear regression • {BMI}=B1{SNP}+ B2{Age}+... • Further investigate / report top SNPs only • Adjust for population stratification... P-values
What is population stratification? Balding, Nature Reviews Genetics 2010
Adjusting for PC's • Li et al., Science 2008
Adjusting for PC's • Razib, Current Biology 2008
Adjusting for PC's • Wang, BMC Proc 2009
Aside: “random” mating? Sebro, Gen Epi, 2010
Multiple comparison correction • If you conduct 20 tests at =0.05, one true by chance http://xkcd.com/882/. If you conduct 1 million tests... • Correct for multiple comparisons • e.g., Bonferroni, 1 million gives =5x10-8
QQ-plots and PC adjustment Wang, BMC Proc 2009
Example: GWAS of Prostate Cancer chromosome http://cgems.cancer.gov Multiple prostate cancer loci on 8q24 Witte, Nat Genet 2007
Outline GWAS Overview Design Microarray Sequencing Which to use and other considerations QC Analysis Population stratification adjustment Imputation Replication & Meta-analysis Limitations, missing heritability, and beyond
Imputation of SNP Genotypes • Combine data from different platforms (e.g., Affy & Illumina) (for replication / meta-analysis). • Estimate unmeasured or missing genotypes. • Based on measured SNPs and external info (e.g., haplotype structure of HapMap). • Increase GWAS power (impute and analyze all), e.g. Sick sinus syndrome, most significant was 1000 Genomes imputed SNP (Holm et al., Nature Genetics, 2011) • HapMap as reference, now 1000 Genomes Project?
Imputation Example Li et al., Ann Rev Genom Human Genet, 2009
Imputation Example Li et al., Ann Rev Genom Human Genet, 2009
Imputation Application TCF7L2 gene region & T2D from the WTCCC data Observed genotypes black Imputed genotypes red. Chromosomal Position Marchini Nature Genetics2007 http://www.stats.ox.ac.uk/~marchini/#software
Outline GWAS Overview Design Microarray Sequencing Which to use and other considerations QC Analysis Population stratification adjustment Imputation Replication & Meta-analysis Limitations, missing heritability, and beyond
Replication To replicate: Association test for replication sample significant at 0.05 alpha level Same mode of inheritance Same direction Sufficient sample size for replication Non-replications not necessarily a false positive LD structures, different populations (e.g., flip-flop) covariates, phenotype definition, underpowered
Prostate Cancer Replications Witte, Nat Rev Genet 2009 Modest ORs
Prostate Cancer Replications Witte, Nat Rev Genet 2009 Modest ORs
SNPs Missed in Replication? 24,223 smallest P-value! Witte, Nat Rev Genet, 2009
Meta-analysis Combine multiple studies to increase power Either combine p-values (Fisher’s test), or z-scores (better)
(Meta-analysis)Example: GWAS of Prostate Cancer chromosome http://cgems.cancer.gov Multiple prostate cancer loci on 8q24 Witte, Nat Genet 2007
Outline GWAS Overview Design Microarray Sequencing Which to use and other censiderations QC Analysis Population stratification adjustment Imputation Replication & Meta-analysis Limitations, missing heritability, and beyond