Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

EPI293Design and analysis of gene association studiesWinter Term 2008Lecture 2: Patterns of LD and “tag SNP” selection Peter Kraft pkraft@hsph.harvard.eduBldg 2 Rm 2072-4271

Before HapMap: “looking under lamppost” Study 1: Pop’n A, small N, no assoc’n Study 2: Pop’n A, large N, no assoc’n Study 3: Pop’n B, large N, assoc’n After HapMap Study 2 revisited: Pop’n A, large N, assoc’n

Outline • Measures of linkage disequilibrium • Reasons for LD and empirical patterns of LD • “Tagging” SNPs • The HapMap project • Resources and tools for SNP selection

Basic idea: linkage disequilibrium A G A G a g a g a g A g A G A G A G A G a g A g a g a g A G A G Alleles at two (or more) loci are correlated on chromosomes drawn at random from the population

Measures of linkage disequilibrium • Basic data: table of haplotype frequencies A G A G a g a g a g A g A G A G A G A G a g A g a g a g A G A G

Linkage disequilibrium and marginal allele freqs. • pA & pG are (minor) allele frequencies • qA = 1-pA; qG = 1-pG •  = x z – y w is a measure of departure from independence • No association between A and G   = 0 • Max() = min(pA qG, pG qA)

|D’| and r2 are most common • D prime … • …ranges from 0 [no LD] to 1 [complete LD]… • …is less sensitive to marginal allele frequencies… • …is directly related to recombination fraction • R squared… • …also ranges from 0 to 1… • …is correlation between alleles on the same chromosome… • …is very sensitive to marginal allele frequencies… • …is directly related to study power • If a marker M and causal gene G are in LD, then a study with N cases and controls which measures M (but not G) will have the same power to detect an association as a study with r2 N cases and controls that directly measured G • r2 N is the “effective sample size”

A G A G a g a g a g A g A G A G A G A G a g A g a g a g A G A G D’ = r2 = (86 - 0)2 / (10688) = .6 (86 - 0) / (86) =1

Computational detail • Haplotyopes are rarely directly observed • Have to infer from genotype data • Genotypes consistent with haplotype pairs • Most popular algorithm: Expectation Maximizxation1 • Related to, but not exactly equal to 3x3 table of genotypes Aa A a A a Gg G g g G Correlation from this table makes no assumptions about HWE (Weir, Genetic Data Analysis) 1 Thomas pp. 243-245

Why does LD exist? • “Recombination coldspots” • Demographics (e.g. bottlenecks) • Population stratification or admixture • Confounds gene-disease association • Does not decay with distance (among other reasons… selective pressure … etc.)

A Decay of LD in Pictures

Decay of LD: T = 0 (1 - )T 1 generation 5 generations 10 20 40 80

200 kbp from chr2, positions 51,783,239 to 51,983,238 Data from the ENCODE project http://www.hapmap.org/downloads/encode1.html.en

Implications • Admixture can lead to false positives • Two unlinked loci can stay in LD • Recent admixture, continual gene flow problematic • Isolated populations have advantages for fine-mapping • LD extends long distances, so fewer markers need be typed • But resolution may be poor Knowledge of local LD structure is essential for candidate gene studies !

Basic “tagging” design Measure haplotypes/LD pattern in a subsample (often external database) Choose subset of SNPs (“tagSNPs”) that contain majority of information Genotype “tagSNPs” in main study,analyze appropriately

Over 750 known SNPs – at least 50 are common in Europeans ATM

ATM

“block” = region of limited haplotype diversity and/or low LD

But there are unappealing aspects of the “haplotype block” idea • Definition and “block finding” algorithms are ad hoc • Different defns, algs lead to different block structures • Block structure changes with sample size, marker density • “Hard boundaries” are… • …unappealing for tagSNP selection (what about “between blocks”)… • … inaccurate description of LD patterns (some haps overlap boundaries) • Plus, haplotypes present analytic challenges • [Wall & Pritchard (2003a) Nat Rev Genet 4:587 (2003b) AJHG 73:502][Nothnagel and Rohde (2005) AJHG 77:988

CYP19

G/C 3 G/A 2 T/C 4 G/C 5 A/T 1 A/C 6 G G A A G G G T T G G A C C C C C C C C C C C C A A A A T T G G G C C C high r2 high r2 high r2 Keep it simple • We want SNPs that predict unobserved variants • Why not choose SNPs based on pairwise correlations? • Q: What if we don’t know enough about common genetic variation to say we’ve captured it? • A: HapMap and resequencing projects

HapMap: application in the design and interpretation of association studies Mark J. Daly, PhD on behalf of The International HapMap Consortium [OK it may look like I’m totally stealing these slides—but they are free on the web at http://www.hapmap.org/tutorials.html.en]

Goals of this segment • Briefly summarize HapMap design and current status • Discuss the application of HapMap to all aspects of association study design, analysis and interpretation

HapMap Project High-density SNP genotyping across the genome provides information about • SNP validation, frequency, assay conditions • correlation structure of alleles in the genome A freely-available public resource to increase the power and efficiency of genetic association studies to medical traits All data is freely available on the web for application in study design and analyses as researchers see fit

HapMap Samples • 90 Yoruba individuals (30 parent-parent-offspring trios) from Ibadan, Nigeria (YRI) • 90 individuals (30 trios) of European descent from Utah (CEU) • 45 Han Chinese individuals from Beijing (CHB) • 45 Japanese individuals from Tokyo (JPT)

HapMap progress • PHASE I – completed, described in Nature paper • * 1,000,000 SNPs successfully typed in all 270 HapMap samples • * ENCODE variation reference resource available • PHASE II –data generation complete, data released early November 2005 • * >3,500,000 SNPs typed in total !!! Frazer, K. A., D. G. Ballinger, D. R. Cox, D. A. Hinds, L. L. Stuve, R. A. Gibbs, J. W. Belmont, A. Boudreau, P. Hardenbol, S. M. Leal, S. Pasternak, D. A. Wheeler, et al. (2007). "A second generation human haplotype map of over 3.1 million SNPs." Nature 449(7164): 851-61.

ENCODE-HAPMAP variation project • Ten “typical” 500kb regions • 48 samples sequenced • All discovered SNPs (and those dbSNP) typed in all 270 HapMap samples • Current data set – 1 SNP every 279 bp A much more complete variation resource by which the genome-wide map can evaluated

Completeness of dbSNP Vast majority of common SNPs are contained in or highly correlated with a SNP in dbSNP

Recombination hotspots are widespreadand account for LD structure 7q21

Coverage of Phase II HapMap(estimated from ENCODE data) Panel %r2 > 0.8 max r2 YRI 81 0.90 CEU 94 0.97 CHB+JPT 94 0.97 Vast majority of common variation (MAF > .05) captured by Phase II HapMap From Table 6 – “A Haplotype Map of the Human Genome”, Nature

Applying the HapMap • Study design - tagging • Study coverage evaluation • Study analysis - improving association testing • Study interpretation • Comparison of multiple studies • Connection to genes/genomic features • Integration with expression and other functional data • Other uses of HapMap data • Admixture, LOH, selection

Tagging from HapMap • Since HapMap describes the majority of common variation in the genome, choosing non-redundant sets of SNPs from HapMap offers considerable efficiency without power loss in association studies

G/C 3 G/A 2 T/C 4 G/C 5 A/T 1 A/C 6 G G A A G G G T T G G A C C C C C C C C C C C C A A A A T T G G G C C C high r2 high r2 high r2 Pairwise tagging Tags: SNP 1 SNP 3 SNP 6 3 in total Test for association: SNP 1 SNP 3 SNP 6 After Carlson et al. (2004) AJHG 74:106

Pairwise Tagging Efficiency Tag SNPs were picked to capture common SNPs in release 16c.1 for every 7,000 SNP bin using Haploview. Tagging Phase I HapMap offers 2-5x gains in efficiency

G/C 3 G/A 2 T/C 4 G/C 5 A/T 1 A/C 6 A A G G G G G T T G G A A C C C C C C C C C C C C C C C A A T T A A G G G C C C Use of haplotypes can improve genotyping efficiency Tags: SNP 1 SNP 3 2 in total Test for association: SNP 1 captures 1+2 SNP 3 captures 3+5 “AG” haplotype captures SNP 4+6 Tags: SNP 1 SNP 3 SNP 6 3 in total Test for association: SNP 1 SNP 3 SNP 6 tags in multi-marker test should be conditional on significance of LD in order to avoid overfitting

Efficiency and power tag SNPs ~300,000 tag SNPs needed to cover common variation in whole genome in CEU Relative power (%) random SNPs Average marker density (per kb) P.I.W. de Bakker et al. (2005) Nat Genet Advance Online Publication 23 Oct 2005

Will tag SNPs picked from HapMap apply to other population samples? Two issues: what if LD structure strongly differs between my samples and the HapMap samples? Are CEU or YRI panels good surrogates for Latinos from Los Angeles? Are CEU samples even good surrogates for whites from France? Is HapMap sample size sufficient? Small sample  correlation overestimated; are tagging algorithms “overfitting” the sample PK slide

Will tag SNPs picked from HapMap apply to other population samples? CEU CEU CEU Utah residents with European ancestry(CEPH) Whites from Los Angeles, CA Botnia, Finland Population differences add very little inefficiency Paul de Bakker Pac Symp Biocomput 2006

De Bakker et al (2006) Nat Genet

Need and Goldstein (2006) Nat Genet

Impact of training set sample size Tags chosen as pairwise tags Tags chosen as multimarker tags(up to 6 markers) Zeggini et al Nature Genetics37, 1320 - 1322 (2005)

Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271