1 / 55

Workshop in Bioinformatics

Workshop in Bioinformatics. Eran Halperin. The Human Genome Project. “What we are announcing today is that we have reached a milestone…that is, covering the genome in…a working draft of the human sequence.”.

johncmartin
Download Presentation

Workshop in Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Workshop in Bioinformatics Eran Halperin

  2. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering the genome in…a working draft of the human sequence.” “But our work previously has shown… that having one genetic code is important, but it's not all that useful.” “I would be willing to make a predication that within 10 years, we will have the potential of offering any of you the opportunity to find out what particular genetic conditions you may be at increased risk for…” Washington, DC June, 26, 2000

  3. The Vision of Personalized Medicine Genetic and epigenetic variants + measurable environmental/behavioral factors wouldbe used for a personalized treatment and diagnosis

  4. Example: Warfarin An anticoagulant drug, useful in the prevention of thrombosis.

  5. Example: Warfarin Warfarin was originallyused as rat poison. Optimal dose variesacross the population Genetic variants (VKORC1 and CYP2C9) affect the variation of the personalized optimal dose.

  6. Association Studies Genetic variants such as Single Nucleotide Polymorphisms (SNPs), Copy Number Variants (CNVs) are tested for association with the trait.

  7. Associated SNP Where should we look? Usually SNPs are bi-allelic SNP= Single Nucleotide Polymorphism Cases: AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTC AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTC AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC Controls: AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTC AGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC

  8. Published Genome-Wide Associations through 6/2009, 439 published GWA at p < 5 x 10-8 NHGRI GWA Catalog www.genome.gov/GWAStudies

  9. Genetic Factors Complexdisease Environmental Factors Multiple genes may affect the disease. Therefore, the effect of every single gene may be negligible.

  10. How does it work? • For every pair of SNPs we can construct a contingency table:

  11. Results: Manhattan Plots

  12. The curse of dimensionality – corrections of multiple testing • In a typical Genome-Wide Association Study (GWAS), we test millions of SNPs. • If we set the p-value threshold for each test to be 0.05, by chance we will “find” about 5% of the SNPs to be associated with the disease. • This needs to be corrected.

  13. Bonferroni Correction • If the number of tests is n, we set the threshold to be 0.05/n. • A very conservative test. If the tests are independent then it is reasonable to use it. If the tests are correlated this could be bad: • Example: If all SNPs are identical, then we lose a lot of power; the false positive rate reduces, but so does the power.

  14. Data

  15. International consortium that aims in genotyping the genome of 270 individuals from four different populations. HUJI 2006

  16. Launched in 2002. • First phase (2005): • ~1 million SNPs for 270 individuals from four populations • Second phase (2007): • ~3.1 million SNPs for 270 individuals from four populations • Third phase (ongoing): • > 1 million SNPs for 1115 individuals across 11 populations HUJI 2006

  17. Other Data Sources • Human Genome Diversity Project • 50 populations, 1000 individuals, 650k SNPs • POPRES • 6000 individuals (controls) • Encode Project • Resequencing, discovery of new SNPs • 1000 Genomes project • dbGAP

  18. Haplotypes

  19. Haplotypes • Can 1,000,000 SNPs tell us everything? • No, but they can still tell us a lot about the rest of the genome. • SNPs in physical proximity are correlated. • A sequence of alleles along a chromosome are called haplotypes.

  20. Haplotype Data in a Block (Daly et al., 2001) Block 6 from Chromosome 5q31

  21. LD structure

  22. Genotype T C C ì ü ì ü ì ü mother chromosome father chromosome A CG í ý í ý í ý G A A î þ î þ î þ ATACGA AGCCGC AGACGA ATCCGC Possible phases: …. Phasing - haplotype inference Haplotypes • Cost effective genotyping technology gives genotypes and not haplotypes. ATCCGA AGACGC

  23. 1??11? 1??11? 1??11? 1??11? 10?11? 11?11? 10011? 11111? 1100?? 0100?? ?100?? ?100?? 1100?? 0100?? 11000? 01001? 1?0??? 1?0??? 10011? 11000? 1?0??? 1?0??? 100??? 110??? Inferring Haplotypes From Trios Parent 1 122112 Parent 2 210022 120222 Child Assumption: No recombination

  24. Population Substructure • Imagine that all the cases are collected from Africa, and all the controls are from Europe. • Many association signals are going to be found • The vast majority of them are false; Why ??? Different evolutionary forces: drift, selection, mutation, migration, population bottleneck.

  25. Natural Selection • Example: being lactose telorant is advantageous in northern Europe, hence there is positive selection in the LCT gene different allele frequencies in LCT

  26. Genetic Drift • Even without selection, the allele frequencies in the population are not fixed across time. • Consider the following case: • We assume Hardy-Weinberg Equilibrium (HWE), that is, individuals are mating randomly in the population. • We assume a constant population size, no mutation, no selection

  27. Genetic Drift: The Wright-Fisher Model Generation 1 Allele frequency 1/9

  28. Genetic Drift: The Wright-Fisher Model Generation 2 Allele frequency 1/9

  29. Genetic Drift: The Wright-Fisher Model Generation 3 Allele frequency 1/9

  30. Genetic Drift: The Wright-Fisher Model Generation 4 Allele frequency 1/3

  31. Genetic Drift: The Wright-Fisher Model

  32. Genetic Drift: The Wright-Fisher Model

  33. Ancestral population

  34. Ancestral population migration

  35. Ancestral population different allele frequencies Genetic drift

  36. Population Substructure • Imagine that all the cases are collected from Africa, and all the controls are from Europe. • Many association signals are going to be found • The vast majority of them are false; What can we do about it?

  37. Jakobsson et al, Nature 421: 998-103

  38. Principal Component Analysis • Dimensionality reduction • Based on linear algebra • Intuition: find the ‘most important’ features of the data

  39. Principal Component Analysis Plotting the data on a onedimensional line for which the ‘spread’ is maximized.

  40. Principal Component Analysis • In our case, we want to look at two dimensions at a time. • The original data has many dimensions – each SNP corresponds to one dimension.

  41. HapMap Populations MKK LWK YRI GIH ASW MEX JPT CHD CHB CEU TSI

  42. HapMap PCA 1-2

  43. HapMap PCA 1-3

  44. HapMap PCA 1,2,4

  45. Ancestry Inference: • To what extent can population structure be detected from SNP data? • What can we learn from these inferences? Novembre et al., 2008

  46. 80% 60% 40% 20% 0% 1 4 7 10 13 22 34 70 16 19 25 28 31 37 40 43 46 49 52 55 58 61 64 67 73 76 79 82 85 88 Ancestry inference in recently admixed populations Puerto Rican Population (GALA study, E. Burchard) 100% Percent racial admixture European Individual subjects 1-90 African Native American

  47. Recombination Events Copy 1 Copy 2 Probability ri for recombinationin position i. child chromosome

  48. Recently Admixed Populations Aftergeneration 1

  49. Recently Admixed Populations Aftergeneration2

More Related