1 / 111

Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

EPI293 Design and analysis of gene association studies Winter Term 2008 Lecture 2: Patterns of LD and “tag SNP” selection. Peter Kraft pkraft@hsph.harvard.edu Bldg 2 Rm 207 2-4271. Before HapMap: “looking under lamppost”. Study 1: Pop’n A, small N, no assoc’n.

farren
Download Presentation

Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 207 2-4271

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EPI293Design and analysis of gene association studiesWinter Term 2008Lecture 2: Patterns of LD and “tag SNP” selection Peter Kraft pkraft@hsph.harvard.eduBldg 2 Rm 2072-4271

  2. Before HapMap: “looking under lamppost” Study 1: Pop’n A, small N, no assoc’n Study 2: Pop’n A, large N, no assoc’n Study 3: Pop’n B, large N, assoc’n After HapMap Study 2 revisited: Pop’n A, large N, assoc’n

  3. Outline • Measures of linkage disequilibrium • Reasons for LD and empirical patterns of LD • “Tagging” SNPs • The HapMap project • Resources and tools for SNP selection

  4. Outline • Measures of linkage disequilibrium • Reasons for LD and empirical patterns of LD • “Tagging” SNPs • The HapMap project • Resources and tools for SNP selection

  5. Basic idea: linkage disequilibrium A G A G a g a g a g A g A G A G A G A G a g A g a g a g A G A G Alleles at two (or more) loci are correlated on chromosomes drawn at random from the population

  6. Measures of linkage disequilibrium • Basic data: table of haplotype frequencies A G A G a g a g a g A g A G A G A G A G a g A g a g a g A G A G

  7. Linkage disequilibrium and marginal allele freqs. • pA & pG are (minor) allele frequencies • qA = 1-pA; qG = 1-pG •  = x z – y w is a measure of departure from independence • No association between A and G   = 0 • Max() = min(pA qG, pG qA)

  8. |D’| and r2 are most common • D prime … • …ranges from 0 [no LD] to 1 [complete LD]… • …is less sensitive to marginal allele frequencies… • …is directly related to recombination fraction • R squared… • …also ranges from 0 to 1… • …is correlation between alleles on the same chromosome… • …is very sensitive to marginal allele frequencies… • …is directly related to study power • If a marker M and causal gene G are in LD, then a study with N cases and controls which measures M (but not G) will have the same power to detect an association as a study with r2 N cases and controls that directly measured G • r2 N is the “effective sample size”

  9. A G A G a g a g a g A g A G A G A G A G a g A g a g a g A G A G D’ = r2 = (86 - 0)2 / (10688) = .6 (86 - 0) / (86) =1

  10. Computational detail • Haplotyopes are rarely directly observed • Have to infer from genotype data • Genotypes consistent with haplotype pairs • Most popular algorithm: Expectation Maximizxation1 • Related to, but not exactly equal to 3x3 table of genotypes Aa A a A a Gg G g g G Correlation from this table makes no assumptions about HWE (Weir, Genetic Data Analysis) 1 Thomas pp. 243-245

  11. Outline • Measures of linkage disequilibrium • Reasons for LD and empirical patterns of LD • “Tagging” SNPs • The HapMap project • Resources and tools for SNP selection

  12. Why does LD exist? • “Recombination coldspots” • Demographics (e.g. bottlenecks) • Population stratification or admixture • Confounds gene-disease association • Does not decay with distance (among other reasons… selective pressure … etc.)

  13. A Decay of LD in Pictures

  14. Decay of LD: T = 0 (1 - )T 1 generation 5 generations 10 20 40 80

  15. 200 kbp from chr2, positions 51,783,239 to 51,983,238 Data from the ENCODE project http://www.hapmap.org/downloads/encode1.html.en

  16. Implications • Admixture can lead to false positives • Two unlinked loci can stay in LD • Recent admixture, continual gene flow problematic • Isolated populations have advantages for fine-mapping • LD extends long distances, so fewer markers need be typed • But resolution may be poor Knowledge of local LD structure is essential for candidate gene studies !

  17. Outline • Measures of linkage disequilibrium • Reasons for LD and empirical patterns of LD • “Tagging” SNPs • The HapMap project • Resources and tools for SNP selection

  18. Basic “tagging” design Measure haplotypes/LD pattern in a subsample (often external database) Choose subset of SNPs (“tagSNPs”) that contain majority of information Genotype “tagSNPs” in main study,analyze appropriately

  19. Over 750 known SNPs – at least 50 are common in Europeans ATM

  20. ATM

  21. “block” = region of limited haplotype diversity and/or low LD

  22. But there are unappealing aspects of the “haplotype block” idea • Definition and “block finding” algorithms are ad hoc • Different defns, algs lead to different block structures • Block structure changes with sample size, marker density • “Hard boundaries” are… • …unappealing for tagSNP selection (what about “between blocks”)… • … inaccurate description of LD patterns (some haps overlap boundaries) • Plus, haplotypes present analytic challenges • [Wall & Pritchard (2003a) Nat Rev Genet 4:587 (2003b) AJHG 73:502][Nothnagel and Rohde (2005) AJHG 77:988

  23. CYP19

  24. CYP19

  25. G/C 3 G/A 2 T/C 4 G/C 5 A/T 1 A/C 6 G G A A G G G T T G G A C C C C C C C C C C C C A A A A T T G G G C C C high r2 high r2 high r2 Keep it simple • We want SNPs that predict unobserved variants • Why not choose SNPs based on pairwise correlations? • Q: What if we don’t know enough about common genetic variation to say we’ve captured it? • A: HapMap and resequencing projects

  26. Outline • Measures of linkage disequilibrium • Reasons for LD and empirical patterns of LD • “Tagging” SNPs • The HapMap project • Resources and tools for SNP selection

  27. HapMap: application in the design and interpretation of association studies Mark J. Daly, PhD on behalf of The International HapMap Consortium [OK it may look like I’m totally stealing these slides—but they are free on the web at http://www.hapmap.org/tutorials.html.en]

  28. Goals of this segment • Briefly summarize HapMap design and current status • Discuss the application of HapMap to all aspects of association study design, analysis and interpretation

  29. HapMap Project High-density SNP genotyping across the genome provides information about • SNP validation, frequency, assay conditions • correlation structure of alleles in the genome A freely-available public resource to increase the power and efficiency of genetic association studies to medical traits All data is freely available on the web for application in study design and analyses as researchers see fit

  30. HapMap Samples • 90 Yoruba individuals (30 parent-parent-offspring trios) from Ibadan, Nigeria (YRI) • 90 individuals (30 trios) of European descent from Utah (CEU) • 45 Han Chinese individuals from Beijing (CHB) • 45 Japanese individuals from Tokyo (JPT)

  31. HapMap progress • PHASE I – completed, described in Nature paper • * 1,000,000 SNPs successfully typed in all 270 HapMap samples • * ENCODE variation reference resource available • PHASE II –data generation complete, data released early November 2005 • * >3,500,000 SNPs typed in total !!! Frazer, K. A., D. G. Ballinger, D. R. Cox, D. A. Hinds, L. L. Stuve, R. A. Gibbs, J. W. Belmont, A. Boudreau, P. Hardenbol, S. M. Leal, S. Pasternak, D. A. Wheeler, et al. (2007). "A second generation human haplotype map of over 3.1 million SNPs." Nature 449(7164): 851-61.

  32. ENCODE-HAPMAP variation project • Ten “typical” 500kb regions • 48 samples sequenced • All discovered SNPs (and those dbSNP) typed in all 270 HapMap samples • Current data set – 1 SNP every 279 bp A much more complete variation resource by which the genome-wide map can evaluated

  33. Completeness of dbSNP Vast majority of common SNPs are contained in or highly correlated with a SNP in dbSNP

  34. Recombination hotspots are widespreadand account for LD structure 7q21

  35. Coverage of Phase II HapMap(estimated from ENCODE data) Panel %r2 > 0.8 max r2 YRI 81 0.90 CEU 94 0.97 CHB+JPT 94 0.97 Vast majority of common variation (MAF > .05) captured by Phase II HapMap From Table 6 – “A Haplotype Map of the Human Genome”, Nature

  36. Applying the HapMap • Study design - tagging • Study coverage evaluation • Study analysis - improving association testing • Study interpretation • Comparison of multiple studies • Connection to genes/genomic features • Integration with expression and other functional data • Other uses of HapMap data • Admixture, LOH, selection

  37. Tagging from HapMap • Since HapMap describes the majority of common variation in the genome, choosing non-redundant sets of SNPs from HapMap offers considerable efficiency without power loss in association studies

  38. G/C 3 G/A 2 T/C 4 G/C 5 A/T 1 A/C 6 G G A A G G G T T G G A C C C C C C C C C C C C A A A A T T G G G C C C high r2 high r2 high r2 Pairwise tagging Tags: SNP 1 SNP 3 SNP 6 3 in total Test for association: SNP 1 SNP 3 SNP 6 After Carlson et al. (2004) AJHG 74:106

  39. Pairwise Tagging Efficiency Tag SNPs were picked to capture common SNPs in release 16c.1 for every 7,000 SNP bin using Haploview. Tagging Phase I HapMap offers 2-5x gains in efficiency

  40. G/C 3 G/A 2 T/C 4 G/C 5 A/T 1 A/C 6 A A G G G G G T T G G A A C C C C C C C C C C C C C C C A A T T A A G G G C C C Use of haplotypes can improve genotyping efficiency Tags: SNP 1 SNP 3 2 in total Test for association: SNP 1 captures 1+2 SNP 3 captures 3+5 “AG” haplotype captures SNP 4+6 Tags: SNP 1 SNP 3 SNP 6 3 in total Test for association: SNP 1 SNP 3 SNP 6 tags in multi-marker test should be conditional on significance of LD in order to avoid overfitting

  41. Efficiency and power tag SNPs ~300,000 tag SNPs needed to cover common variation in whole genome in CEU Relative power (%) random SNPs Average marker density (per kb) P.I.W. de Bakker et al. (2005) Nat Genet Advance Online Publication 23 Oct 2005

  42. Will tag SNPs picked from HapMap apply to other population samples? Two issues: what if LD structure strongly differs between my samples and the HapMap samples? Are CEU or YRI panels good surrogates for Latinos from Los Angeles? Are CEU samples even good surrogates for whites from France? Is HapMap sample size sufficient? Small sample  correlation overestimated; are tagging algorithms “overfitting” the sample PK slide

  43. Will tag SNPs picked from HapMap apply to other population samples? CEU CEU CEU Utah residents with European ancestry(CEPH) Whites from Los Angeles, CA Botnia, Finland Population differences add very little inefficiency Paul de Bakker Pac Symp Biocomput 2006

  44. De Bakker et al (2006) Nat Genet

  45. Need and Goldstein (2006) Nat Genet

  46. Impact of training set sample size Tags chosen as pairwise tags Tags chosen as multimarker tags(up to 6 markers) Zeggini et al Nature Genetics37, 1320 - 1322 (2005)

More Related