Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 206 2-4271

Peter Kraft pkraft@hsph.harvard.eduBldg 2 Rm 2062-4271 EPI293Design and analysis of gene association studiesWinter Term 2008Lecture 7: Genome-wide association scans

Principles of Linkage Analysis discovered Association between Blood Groups and malignant disease published Rediscovery of Mendel’s laws 1900 1920 Association between Blood Groups and malignant disease fails to replicate 1940 1960 1980 RFLPs available for linkage analysis developed 1990 Human Genome Project launched Microsattelite maps for genome-wide linkage analysis developed Risch and Merikangas paper 2000 Human Genome Project working draft completed; beginnings of SNP map First Genome-Wide Association Study HapMap launched 2005 Genome-wide SNP panels developed HapMap Phase I completed (draft Phase II available) 2006 2007

pkraft@hsph.harvard.edu Linkage Analysis A A B B C C C C C C A B A C A C B C B C A C

pkraft@hsph.harvard.edu gg Gg GG Case 1 3 4 Control 4 3 1 GG gg GG GG gg Gg Gg GG gg gg Gg Gg GG Gg gg Gg

pkraft@hsph.harvard.edu Linkage vs. Association • Linkage studies • Pro: can scan genome with fewer markers • Cons: Can only detect alleles with large effect; limited resolution (identify broad region, not individual genes); requires data on multiple family members • Association studies • Pros: can detect subtle effects; very fine resolution • Cons: requires 0.5 to 1 million markers to cover whole genome; requires large sample size

Risch and Merikangas (1996) Science 273:1516-7

Schloterer C. Nat Rev Genet. 2004;5:63-9.

Published Genome-Wide Association Scans • Ozaki K. Myocardial Infarction. Nat Genet 2002;32:650–4. • Klein RJ. Age-related macular degeneration. Science 2005;308:385–9. • Maraganore DM. Parkinson disease. Am J Hum Genet 2005;77:685–93. • Shiffman D. Myocardial Infarction. Am J Hum Genet 2005;77:596–605. • Cheung VG. Gene expression. Nature 2005;437:1365-9. • Stranger BE. Gene expression. PLOS Genet 2005;1:695-704. • Mah S. Schizophrenia. Mol Psychiatry 2006;11:471-8. • Herbert A. Obesity. Science 2006; 312:279-83. OLD SLIDE!!!! Reviews • Hirschorn J. Nat Reviews Genet 2005;6: 95-108. • Wang WY. Nat Reviews Genet 2005;6: 109-18. • Thomas DC. Am J Hum Genet 2005 77: 337-45. • Thomas DC. Cancer Epidemiol Biomarkers Prev 2006 15: 595-8. • Evans DM. Trends in Genetics 2006 (epub)

Genotyping errors 96 cases, 50 controls 103,611 SNPs rs380390Recessive OR7.4 (2.9-19)PAR (70%) Functionality Replication Science 2005;308:421–4Science 2005;308:419–21 Klein RJ Science 2005;308:385–9

Maraganore Am J Hum Genet 2005 77:685-93 Tier 1Tier 2 443 sib pairs 332 matched unrelated case-control pairs198,000 SNPs 3,148 SNPs No SNPs pass Bonferroni-corrected significance threshold (2.510-7).

Known Prostate Cancer Genes, November 2006 Known Breast Cancer Genes, November 2006

Known Prostate Cancer Genes, Fall 2007 Known Breast Cancer Genes, Fall 2007

Kraft and Cox 2008 in: Rao and Gu, eds.

Outline • Power issues • Tagging efficiency of genome-wide panels • Multi-stage design and analysis • Design issues • Analytic issues • Imputation • CGEMS examples

r2 Known r2 Unknown

Barrett JC. Nat Genet 200638:659-62 Pe’er I. Nat Genet 2006;38:663-7.

International HapMap Consortium. Nature. 2007 Oct 18;449(7164):851-61

Distribution of max r2 with tag panel as a function of MAF Tags chosen from a “pseudo Phase II HapMap” and evaluated against ENCODE SNPs

Power adjusting for tagging efficiency The fundamental theorem of the HapMap The power of a study that genotypes N cases and N controls at a marker that has a correlation of r2 with a disease susceptibility locus has the same power as a study that genotypes N= r2 N cases and N controls at the disease susceptibility locus. Pritchard JK. Am J Hum Genet 2001;69:1-14.Jorgenson Am J Hum Genet 2006;78:884-8.Terwilliger JD Eur J Hum Genet 2006;14:426-37.

OR=1.3 OR=1.8 OR=1.5 MAF=.01 direct indirect(averaged over r2) MAF=.05 power indirect(r2 fixed at 80%) MAF=.10 sample size (cases)

Outline • Power issues • Tagging efficiency of genome-wide panels • Multi-stage design and analysis • Design issues • Analytic issues • Imputation • CGEMS examples

subjects SNPs T1 T2 T3

Power of multi-stage designs Joint analysis Replication analysis Ts* = 1..s Ts Power = Pr(T1>k1,…,TS>kS)=Pr(T1>k1)…Pr(TS>kS) Power = Pr(T1*>k1*,…,TS*>kS*) ks* chosen s.t. expected number of markers (under null) taken to s+1st stage is ms+1 ks = Quantile(1-ms+1/ms) mS+1 is number of expected false leads (under the null) at the end of Sth stage(e.g. mS+1 = .05 is strong control of FWER at α=.05) Skol. Nat Genet 2006;38:209-13; Wang Genet Epidemiol 2006;30:356-68; Kraft (in prep)

Multistage Design and Analysis • It is (or should be) well known that “replication analysis ” is statistically inefficient [cf Thomas DC et al (1985) AJE, Skol (2006) Nat Genet] • Usually you can find a multistage design that has almost the same power as a single-stage design but is much cheaper • Multi-stage design is NOT a way of finessing the multiple testing issue. If genotypes were free, you would genotype everybody for every SNP and test all SNPs at very very small alpha level. • Multi-stage design IS a way of saving big $s, ₤s, €s, etc.

Amount of savings and cheapest design depend on prices—which are very fluid!

Number of Markers Number of subjects Effective  level Power M1 N1 1=M2/M1 P1=1-q,,r,N1,1 M2 N2 2=M3/M2 P2=1-q,,r,N2,2 … Mk Nk k=Mk+1/Mk Pk=1-q,,r,Nk,k Overall Πi=1..k Pi Calculating power for “replication analysis” Mk+1 is “number of significant tests expected under the null” E.g. Mk+1=.05 is Bonferroni-corrected threshold for M1 tests

Number of Markers Number of subjects Effective  level Power 500,000 2,400 (1:1 case:control) 1=.003 1,500 6,000 2=.003 Overall Calculating power for “replication analysis” .883 .999 .882 q=10%; dominant OR=1.4; M4=5 Cost: ca. USD7002,400+USD606,000=USD 2.04 million

Number of Markers Number of subjects Effective  level Power 500,000 2,400 (1:1 case:control) 1=.04 20,000 3,000 2=.075 1,500 3,000 3=.003 Overall Calculating power for “replication analysis” .999 .998 .950 .946 q=10%; dominant OR=1.4; M4=5 Cost: ca. USD7002,400+USD2003,000+USD603,000=USD 2.46 million Two-stage study with equivalent power costs > 2.8 million

Three different per-SNP pricing scenarios considered Prices relative to per-SNP costs for whole-genome platform

Pricing scheme A; cost relative to single stage study using 7,000 subjects

Relative costs for studies with 65% power

Power for single stage studies, accounting for tagging efficiency

Illumina 550 Affy 500 Affy 1,000 Power relative cost relative cost relative cost Power for three stage studies, accounting for tagging efficiency

Illumina 550 Affy 500 Affy 1,000 (Simulated) tagging properties of three panels

How to select SNPs for 2nd Stage? • Rank by increasing p-value • But recall, prob. of being false positive depends not only on p-value, but also on power and prior • Hence Bayesian alternatives [WTCCC, Wakefield 2007 Am J Hum Genet] • Quasi-Bayesian FPRP [Wacholder et al 2004; Samani 2007 NEJM] • Prior-weighted analyses [Roeder 2007 Genet Epidemiol, Lewinger 2007 Genet Epidemiol] • Pragmatist: meh, no big difference in practice • What about multiple SNPs in high LD? • Cull so as to interrogate as many regions as possible (“broad” follow up), or retain to try and distinguish causal variant (“deep” follow up)? • Can I improve coverage by genotyping more SNPs around “hits”? • Again: “deep” coverage

“broad” follow-up “deep” follow-up “broad” / “deep” defined

Thought Experiment • Two kinds of GWAS products • Tagging—captures HapMap II at r2>80% • Random—has density of Affy 500k • Choose additional SNPs in 2nd stage so that you tag region spanning “hit” in HapMap II at >95% • Does this increase your power over simply genotyping the top hit?

1.46 X markers Tagging Panel # markers per region Random Panel 3.22 X markers

OR=1.3, MAF=.10Two-stage designs7,000 cases/controls Tagging Panel Broad maximum power for budget  cost Deep Power of one-stage design Random Panel

“deep” follow-up “broad” follow-up Am J Hum Genet 2007

Very small gain in power from fine mapping=deep follow up. Is it worth the opportunity cost? Genotyping a lot of extra markers “fine mapping” null loci means you will miss the chance to replicate the true signals that happened to be lower on your list.

Power calculations http://www.sph.umich.edu/csg/abecasis/CaTS/

Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 206 2-4271