1 / 32

Of Sea Urchins, Birds and Men

Of Sea Urchins, Birds and Men. Introduction. SNPs, HAPLOTYPES. Genome Assembly (Ch. 5) Hidden Markov Models (Ch. 4) Phylogenetic Trees (Ch. 3) Sequence Alignment (Ch. 1). Single Nucleotide Polymorphism (SNP). GATTTAGATC G CGATAGAG GATTTAGATC T CGATAGAG.

perezangela
Download Presentation

Of Sea Urchins, Birds and Men

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Of Sea Urchins, Birds and Men

  2. Introduction SNPs, HAPLOTYPES

  3. Genome Assembly (Ch. 5) Hidden Markov Models (Ch. 4) Phylogenetic Trees (Ch. 3) Sequence Alignment (Ch. 1)

  4. Single Nucleotide Polymorphism (SNP) GATTTAGATCGCGATAGAG GATTTAGATCTCGATAGAG • The most abundant type of polymorphism A SNP is a position in a genome at which two or more different bases occur in the population, each with a frequency >1%. The two alleles at the site are G andT

  5. tttctccatttgtcgtgacacctttgttgacaccttcatttctgcattctcaattctatttcactggtctatggcagagaacacaaaatatggccagtggcctaaatccagcctactaccttttttttttttttgtaacattttactaacatagccattcccatgtgtttccatgtgtctgggctgcttttgcactctaatggcagagttaagaaattgtagcagagaccacaatgcctcaaatatttactctacagccctttataaaaacagtgtgccaactcctgatttatgaacttatcattatgtcaataccatactgtctttattactgtagttttataagtcatgacatcagataatgtaaatcctccaactttgtttttaatcaaaagtgttttggccatcctagatatactttgtattgccacataaatttgaagatcagcctgtcagtgtctacaaaatagcatgctaggattttgatagggattgtgtagaatctatagattaattagaggagaatgactatcttgacaatactgctgcccctctgtattcgtgggggattggttccacaacaacacccaccccccactcggcaacccctgaaacccccacatcccccagcttttttcccctgctaccaaaatccatggatgctcaagtccatataaaatgccatactatttgcatataacctctgcaatcctcccctatagtttagatcatctctagattacttataatactaataaaatctaaatgctatgtaaatagttgctatactgtgttgagggttttttgttttgttttgttttatttgtttgtttgtttgtattttaagagatggtgtcttgctttgttgcccaggctggagtgcagtggtgagatcatagcttactgcagcctcaaactcctggactcaaacagtcctcccacctcagcctcccaaagtgctgggatacaggtgtgacccactgtgcccagttattattttttatttgtattattttactgttgtattatttttaattattttttctgaatattttccatctatagttggttgaatcatggatgtggaacaggcaaatatggagggctaactgtattgcatcttccagttcatgagtatgcagtctctctgtttatttaaagttttagtttttctcaaccatgtttacttttcagtatacaagactttgacgttttttgttaaatgtatttgtaagtattttattatttgtgatgttatttaaaaagaaattgttgactgggcacagtggctcacgcctgtaatcccagcactttgggaggctgaggcgggcagatcacgaggtcaggagatcaagaccatcctggctaacatggtaaaaccccgtctctactaaaaatagaaaaaaattagccaggcgtggtggcgagtgcctgtagtcccagctactcgggaggctgaggcaggagaatggtgtgaacctgggaggcggagcttgcagtgagctgagatcgtgccactgcattccagcctgcgtgacagagcgagactctgtcaaaaaaataaataaaatttaaaaaaagaagaagaaattattttcttaatttcattttcaggttttttatttatttctactatatggatacatgattgatttttgtatattgatcatgtatcctgcaaactagctaacatagtttattatttctctttttttgtggattttaaaggattttctacatagataaataaacacacataaacagttttacttctttcttttcaacctagactggatgcattttttgtttttgtttgtttgtttgctttttaacttgctgcagtgactagagaatgtattgaagaatatattgttgaacaaaagcagtgagagtggacatccctgctttccccctgattttagggggaatgttttcagtctttcactatttaatatgattttagctataggtttatcctagatccctgttatcatgttgaggaaattcccttctatttctagtttgttgagattttttaattcatgtgattgcgctatctggctttgctctcatttctccatttgtcgtgacacctttgttgacaccttcatttctgcattctcaattctatttcactggtctatggcagagaacacaaaatatggccagtggcctaaatccagcctactaccttttttttttttttgtaacattttactaacatagccattcccatgtgtttccatgtgtctgggctgcttttgcactctaatggcagagttaagaaattgtagcagagaccacaatgcctcaaatatttactctacagccctttataaaaacagtgtgccaactcctgatttatgaacttatcattatgtcaataccatactgtctttattactgtagttttataagtcatgacatcagataatgtaaatcctccaactttgtttttaatcaaaagtgttttggccatcctagatatactttgtattgccacataaatttgaagatcagcctgtcagtgtctacaaaatagcatgctaggattttgatagggattgtgtagaatctatagattaattagaggagaatgactatcttgacaatactgctgcccctctgtattcgtgggggattggttccacaacaacacccaccccccactcggcaacccctgaaacccccacatcccccagcttttttcccctgctaccaaaatccatggatgctcaagtccatataaaatgccatactatttgcatataacctctgcaatcctcccctatagtttagatcatctctagattacttataatactaataaaatctaaatgctatgtaaatagttgctatactgtgttgagggttttttgttttgttttgttttatttgtttgtttgtttgtattttaagagatggtgtcttgctttgttgcccaggctggagtgcagtggtgagatcatagcttactgcagcctcaaactcctggactcaaacagtcctcccacctcagcctcccaaagtgctgggatacaggtgtgacccactgtgcccagttattattttttatttgtattattttactgttgtattatttttaattattttttctgaatattttccatctatagttggttgaatcatggatgtggaacaggcaaatatggagggctaactgtattgcatcttccagttcatgagtatgcagtctctctgtttatttaaagttttagtttttctcaaccatgtttacttttcagtatacaagactttgacgttttttgttaaatgtatttgtaagtattttattatttgtgatgttatttaaaaagaaattgttgactgggcacagtggctcacgcctgtaatcccagcactttgggaggctgaggcgggcagatcacgaggtcaggagatcaagaccatcctggctaacatggtaaaaccccgtctctactaaaaatagaaaaaaattagccaggcgtggtggcgagtgcctgtagtcccagctactcgggaggctgaggcaggagaatggtgtgaacctgggaggcggagcttgcagtgagctgagatcgtgccactgcattccagcctgcgtgacagagcgagactctgtcaaaaaaataaataaaatttaaaaaaagaagaagaaattattttcttaatttcattttcaggttttttatttatttctactatatggatacatgattgatttttgtatattgatcatgtatcctgcaaactagctaacatagtttattatttctctttttttgtggattttaaaggattttctacatagataaataaacacacataaacagttttacttctttcttttcaacctagactggatgcattttttgtttttgtttgtttgtttgctttttaacttgctgcagtgactagagaatgtattgaagaatatattgttgaacaaaagcagtgagagtggacatccctgctttccccctgattttagggggaatgttttcagtctttcactatttaatatgattttagctataggtttatcctagatccctgttatcatgttgaggaaattcccttctatttctagtttgttgagattttttaattcatgtgattgcgctatctggctttgctctca t c g a g a t c t c g a g c t c g a t c t c t c g a g a t c g a t c g a g c g c g a g a t c g a g c g a g a • Human Genome contains ~ 3 G basepairs arranged in 46 chromosomes. • Two individuals are 99.9% the same. I.e. differ in ~ 3 M basepairs. • SNPs occur once every ~600 bp • Average gene in the human genome spans ~27Kb • ~50 SNPs per gene

  6. Haplotype C A G Haplotypes T T G G C T C G A C A A C A G G T T C G T C A A C A G SNP SNP SNP Two individuals

  7. Mutations Infinite Sites Assumption: Each site mutates at most once

  8. Haplotype Pattern C A G T T T G A C A T G C T G T 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 1 At each SNP site label the two alleles as 0 and 1. The choice which allele is 0 and which one is 1 is arbitrary.

  9. Recombination G T T C G A C A A C A T A C G T A T C T A T T A G T T C G A CT A T T A

  10. Recombination The two alleles are linked, I.e., they are “traveling together” G T T C G A C A A C A T A C G T A T C T A T T A Recombination disrupts the linkage ? G T T C G A CT A T T A

  11. Emergence of Variations Over Time Disease Mutation Common Ancestor present time Linkage Disequilibrium (LD) Variations in Chromosomes Within a Population

  12. Disease-Causing Mutation 2,000 gens. ago 1,000 gens. ago Extent of Linkage Disequilibrium Time = present

  13. A Data Compression Problem • Select SNPs to use in an association study • Would like to associate single nucleotide polymorphisms (SNPs) with disease. • Very large number of candidate SNPs • Chromosome wide studies, whole genome-scans • For cost effectiveness, select only a subset. • Closely spaced SNPs are highly correlated • It is less likely that there has been a recombination between two SNPs if they are close to each other.

  14. Disease Associations

  15. Control Non-responder Disease Responder Allele 0 Allele 1 Marker A: Allele 0 = Allele 1 = Marker A is associated with Phenotype Association studies

  16. Evaluate whether nucleotide polymorphisms associate with phenotype T T C T C T A G G G G A G A A A G G A C A A A A T T G T G G Association studies

  17. T T T C C T G G A G A G G A G G A A A A C A A A G T T T G G Association studies

  18. 1 1 1 0 0 1 0 0 1 0 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 1 1 0 0 Association studies

  19. Real Haplotype Data A region of Chr. 2245 Caucasian samples Our block-free algorithm Two different runs of the Gabriel el al Block Detection method + Zhang et al SNP selection algorithm

  20. A Data Compression Problem The Minimum Informative Subset

  21. A Data Compression Problem • Select SNPs to use in an association study • Would like to associate single nucleotide polymorphisms (SNPs) with disease. • Very large number of candidate SNPs • Chromosome wide studies, whole genome-scans • For cost effectiveness, select only a subset. • Closely spaced SNPs are highly correlated • It is less likely that there has been a recombination between two SNPs if they are close to each other.

  22. Control Non-responder Disease Responder Allele 0 Allele 1 Marker A: Allele 0 = Allele 1 = Marker A is associated with Phenotype Association studies

  23. Evaluate whether nucleotide polymorphisms associate with phenotype T T C T C T A G G G G A G A A A G G A C A A A A T T G T G G Association studies

  24. T T T C C T G G A G A G G A G G A A A A C A A A G T T T G G Association studies

  25. Hypothesis – Haplotype Blocks? • The genome consists largely of blocks of common SNPs with relatively little recombination within the blocks • Patil et al., Science, 2001; • Jeffreys et al., Nature Genetics, 2001; • Daly et al., Nature Genetics, 2001

  26. Haplotype Block StructureLD-Blocks, and 4-Gamete Test Blocks 200 kb Sense genes DNA Antisense genes SNPs Haplotype blocks 1 2 3 4

  27. Dynamic programming framework Partitioning a chromosome into blocks • Zhang et al. (PNAS, 2002). • Zhang et al. RECOMB, 2003 • H. I. Avi-Itzhak et al. PSB, 2003 • Sebastiani et al. PNAS 2003 • Patil et al., PNAS 2002. Parametric in block test • Solve a dynamic program • Optimal block partition requires the minimal number of blocks. • Within blocks one can select the SNPs that maximize entropy, diversity or r2 correlation

  28. Data Compression A------A---TG-- G------G---CG-- A------G---TC-- A------G---CC-- G------A---TG-- ACGATCGATCATGAT GGTGATTGCATCGAT ACGATCGGGCTTCCG ACGATCGGCATCCCG GGTGATTATCATGAT Selecting Tagging SNPs in blocks Haplotype Blocks based on LD (Method of Gabriel et al.2002)

  29. 1 1 0 0 0 1 1 0 1 0 1 0 0 0 0 1 0 1 0 1 Informativeness s1 s2 s3 s4 s5 I({s1,s2}, s4) = 3/4

  30. 1 1 0 0 0 1 1 0 0 0 1 1 0 0 1 0 0 0 1 1 Informativeness s1 s2 s3 s4 s5 I({s3,s4},{s1,s2,s5}) = 3 S={s3,s4} is a Minimal Informative Subset

  31. 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 1 1 1 s1 s2 s3 s4 s5 e6 Informativeness e5 s5 Graph theory insight Minimum Set Cover= Minimum Informative Subset e4 s4 e3 s3 s2 e2 s1 e1 Edges SNPs

  32. 0 1 0 0 1 0 1 0 0 1 1 0 1 0 0 0 0 1 1 1 s1 s2 s3 s4 s5 e6 Informativeness e5 s5 Graph theory insight Minimum Set Cover {s3, s4}= Minimum Informative Subset e4 s4 e3 s3 s2 e2 s1 e1 SNPs Edges

More Related