Genome-wide association studies (GWAS)

Genome-wide association studies (GWAS) Thomas Hoffmann Department of Epidemiology and Biostatistics, and Institute for Human Genetics

Outline GWAS Overview Study design Plink overview QC Analysis If we have time

Main idea Correlation between genotype (SNP) and trait of interest (e.g., LDL/HDL, blood pressure) Test ~1 Million SNPs to see if any are correlated with the phenotype. Agnostic “shotgun” approach across the genome Hirschhorn & Daly, Nat Rev Genet 2005

Size matters • Large # SNPs tested, multiple comparison correction • Very small effect sizes • TagSNP, rather than actual SNP Visscher, AJHG 2012,

Outline GWAS Overview Study design gPlink overview QC Analysis If we have time

Study design with Quanto • Disease: case-control (unmatched/matched), continuous • Hypothesis: Gene only (also does GxE) • Outcome model: Baseline Risk • Genetic Effect: many GWAS ORs have actually been very small • Parameters: Power/Sample size • Type I error: 5e-8 (=5x10-8, bonferroni correction for 1 million tests) What parameters seem reasonable? How would you explain this in a grant? http://biostats.usc.edu/software

Outline GWAS Overview Study design gPlink overview QC Analysis If we have time

Plink overview Command-line driven code [You have seen this, right?] Start > All Programs > Accessories > Command Prompt Rock solid Graphical interface [You haven't seen this?] Pretty, but a little unstable, especially if you kill something May not allow you to delete some things until restart system Great documentation http://pngu.mgh.harvard.edu/~purcell/plink/reference.shtml Apply an operation to data, which produces output. Manually looking at output much more difficult because of quantity of it for GWAS; use other programs (Haploview [today], stata, R, e.g.) to display that output better

gPLINK setup Click “File > Open project”, and then click “Browse” in the dialogue that pops up, and navigate to where you have your plink file located, then click OK. Ignore any “stream error” errors you get, they don't seem to cause any problems.

gPLINK Setup (2) Click “File > Configure”, then click the two “Browse” buttons, navigating to the haploview and plink executable files, respectively.

QC Steps (1) Remove SNPs with low call rate (e.g., <95%), Individuals with too much missing data Proportion of SNPs actually called by software If it's low, the clusters aren't well defined, artifacts PLINK produce summary: plink --bfile ceu --missing --out ceu-miss PLINK on the fly / to produce a new dataset plink --bfile ceu --geno 0.05 --mind 0.05 <rest of command, e.g., association>

WARNING: It will look like it is not doing anything!!! In the background a plink.exe process has been launched. You need to wait for the green circle.

This creates a new dataset based on the filter, but this can also be done on the fly with some of the association analysis options as well...

QC Steps (2) Remove those with low minor allele frequency? Rarer variants more likely artifacts / underpowered Sometimes fit is unstable Other approaches to deal with rare variants than single SNP at a time (combine them) Produce summary: plink --bfile ceu --freq --out ceu-freq Filter: plink --bfile ceu --maf 0.05 --make-bed --out ceu-0.05

QC Steps (3) SNPs that fail Hardy-Weinberg Suppose a SNP with alleles A and B has allele frequency of p. If random matting, then AA has frequency p*p AB has frequency 2*p*(1-p) BB has frequency (1-p)*(1-p) Test for this (e.g., chi-squared test), if deviating too strongly, likely a bad SNP In practice do for homogeneous populations (more later), controls only

> hwe = read.table("hwe.hwe", header=TRUE, stringsAsFactors=FALSE, fill=TRUE) > head(hwe) CHR SNP TEST GENO O.HET. E.HET. P_HWD 1 1 rs3131968 ALL 0/15/45 0.2500 0.2188 0.5795 2 1 rs3131968 AFF 0/0/0 -1.0000 NA NA 3 1 rs3131968 UNAFF 0/0/0 -1.0000 NA NA 4 1 rs12562034 ALL 0/11/49 0.1833 0.1665 1.0000 5 1 rs12562034 AFF 0/0/0 -1.0000 NA NA 6 1 rs12562034 UNAFF 0/0/0 -1.0000 NA NA >

QC Steps (4) • Check genotype gender from X heterozygosity plink --bfile ceu --check-sex --out ceu-check-sex • Microarrays often have another gender check ceu-check-sex.sexcheck FID IID PEDSEX SNPSEX STATUS F 1341 NA06985 2 2 OK -0.06403 1341 NA06993 1 1 OK 1 1340 NA06994 1 1 OK 1 1340 NA07000 2 2 OK -0.03363 1340 NA07022 1 1 OK 1 1341 NA07034 1 1 OK 1 1341 NA07055 2 2 OK -0.06594 1340 NA07056 2 2 OK -0.02845 1345 NA07345 2 2 OK 0.00151 F: The actual X chromosome inbreeding (homozygosity) estimate. Which ones are female? Has this data been already cleaned?

(5) Check for relatedness, e.g., HapMap Pemberton et al., AJHG 2010 • Take overall average of all SNPs of how many alleles are shared. • E.g., parent-offspring never shares zero alleles. • HapMap was supposed to be unrelateds (this was a bit of a surprise!) • “MZ twins” sometime a duplicated sample...

PLINK Identity By State (IBS) Calculations plink --bfile ceu --genome --out ceu-genome FID1 Family ID for first individual IID1 Individual ID for first individual FID2 Family ID for second individual IID2 Individual ID for second individual RT Relationship type given PED file EZ Expected IBD sharing given PED file Z0 P(IBD=0) Z1 P(IBD=1) Z2 P(IBD=2) PI_HAT P(IBD=2)+0.5*P(IBD=1) ( proportion IBD ) PHE Pairwise phenotypic code (1,0,-1 = AA, AU and UU pairs) DST IBS distance (IBS2 + 0.5*IBS1) / ( N SNP pairs ) PPC IBS binomial test RATIO Of HETHET : IBS 0 SNPs (expected value is 2)

QC Notes Exact thresholds differ based on dataset, how many subjects you have GWAS of 1,000 subjects vs. GWAS of 100,000 subjects Ad-hoc look at QQ-plots (later) to asses some QC issues Look at cumulative distributions may be slightly easier than histograms (what we saw earlier) for cutoffs in the tails

QC Summary Filter SNPs and individuals MAF (very sample size dependent) Low call rates Test HWE Relatedness – remove first degree relatives at least, or account for Check genotypic gender (e.g., mislabeled samples) Anything extra you could do with family data? Mother AA, Father BB, Child AA

GWAS analysis • Most common approach: look at each SNP one-at-a-time • Additive coding of SNP most common, e.g., # of A alleles • Just a covariate in a regression framework • Dichotomous phenotype: logistic regression • Continuous phenotype: linear regression • {BMI}=B1{SNP}+ B2{Age}+... • Can use residuals/propensity score to speed up analysis

What is population stratification? • If allele frequency is different in two populations (races/ethnicities), and a disease prevalence is different in the two... • If not corrected for, what is your test actually telling you? • Generally inflates all test statistics if not accounted for Balding, Nature Reviews Genetics 2010

Adjusting for PC's Maximize variance using all SNPs between subjects – pulls out clusters of individuals of different populations/subpopulations. • Li et al., Science 2008

Adjusting for PC's • Razib, Current Biology 2008

PCs Eigenstrat most common (linux only) Plink can still run something similar, two steps: plink --bfile ceu --genome --out ceu-genome plink --bfile ceu --read-genome ceu-genome.genome --cluster --mds-plot 10 --out ceu-mds

Running the analysisContinuous phenotype: linear regression{BMI}=B1{SNP}+ B2{Age}+...

plink --bfile ceu --linear --covar ceu.cov --covar-name PC1,PC2,PC3,PC4 --pheno ceu.phe --pheno-name phe --out ceu-analysis --hide-covar --hide-covar: supresses lots of output (every covariate for every regression), unfortunately, cannot do in gPLINK

plink --bfile ceu --linear --covar ceu.cov --covar-name PC1,PC2,PC3,PC4 --pheno ceu.phe --pheno-name phe --out ceu-analysis --hide-covar --adjust --adjust: Get the genomic inflation factor lambda, and inflate the test statistics slightly for it (otherwise can inflate later with other packages; especially if you want to run in parallel) Do you really want to adjust? More later...

Format of covariate files Phenotype and covariate files not structurally different from each other: FID IID LDL HDL BMI ID1 ID1 120 160 200 … FID and IID match things in the .fam file of the plink formatted files (which generally calling software produces)

View analysis results

Manhattan plot

Multiple comparison correction • If you conduct 20 tests at =0.05, one true by chance http://xkcd.com/882/. If you conduct 1 million tests... • Correct for multiple comparisons • e.g., Bonferroni, 1 million, =5x10-8

QQ-plots and PC adjustment Wang, BMC Proc 2009 Lambda quantifies how much it is deviating from what we would expect. Want it to be roughly 1, but I don't see too many objections for 1.04, 1.06, e.g., for ~10,000 subjects.

Genomic inflation factor Why might the genomic inflation factor be elevated?

Genome-wide association studies (GWAS)