320 likes | 346 Views
Learn about SNP polymorphisms, genome assembly, hidden Markov models, and more in genomics. Understand phylogenetic trees, sequence alignment, and SNP analysis.
E N D
Introduction SNPs, HAPLOTYPES
Genome Assembly (Ch. 5) Hidden Markov Models (Ch. 4) Phylogenetic Trees (Ch. 3) Sequence Alignment (Ch. 1)
Single Nucleotide Polymorphism (SNP) GATTTAGATCGCGATAGAG GATTTAGATCTCGATAGAG • The most abundant type of polymorphism A SNP is a position in a genome at which two or more different bases occur in the population, each with a frequency >1%. The two alleles at the site are G andT
tttctccatttgtcgtgacacctttgttgacaccttcatttctgcattctcaattctatttcactggtctatggcagagaacacaaaatatggccagtggcctaaatccagcctactaccttttttttttttttgtaacattttactaacatagccattcccatgtgtttccatgtgtctgggctgcttttgcactctaatggcagagttaagaaattgtagcagagaccacaatgcctcaaatatttactctacagccctttataaaaacagtgtgccaactcctgatttatgaacttatcattatgtcaataccatactgtctttattactgtagttttataagtcatgacatcagataatgtaaatcctccaactttgtttttaatcaaaagtgttttggccatcctagatatactttgtattgccacataaatttgaagatcagcctgtcagtgtctacaaaatagcatgctaggattttgatagggattgtgtagaatctatagattaattagaggagaatgactatcttgacaatactgctgcccctctgtattcgtgggggattggttccacaacaacacccaccccccactcggcaacccctgaaacccccacatcccccagcttttttcccctgctaccaaaatccatggatgctcaagtccatataaaatgccatactatttgcatataacctctgcaatcctcccctatagtttagatcatctctagattacttataatactaataaaatctaaatgctatgtaaatagttgctatactgtgttgagggttttttgttttgttttgttttatttgtttgtttgtttgtattttaagagatggtgtcttgctttgttgcccaggctggagtgcagtggtgagatcatagcttactgcagcctcaaactcctggactcaaacagtcctcccacctcagcctcccaaagtgctgggatacaggtgtgacccactgtgcccagttattattttttatttgtattattttactgttgtattatttttaattattttttctgaatattttccatctatagttggttgaatcatggatgtggaacaggcaaatatggagggctaactgtattgcatcttccagttcatgagtatgcagtctctctgtttatttaaagttttagtttttctcaaccatgtttacttttcagtatacaagactttgacgttttttgttaaatgtatttgtaagtattttattatttgtgatgttatttaaaaagaaattgttgactgggcacagtggctcacgcctgtaatcccagcactttgggaggctgaggcgggcagatcacgaggtcaggagatcaagaccatcctggctaacatggtaaaaccccgtctctactaaaaatagaaaaaaattagccaggcgtggtggcgagtgcctgtagtcccagctactcgggaggctgaggcaggagaatggtgtgaacctgggaggcggagcttgcagtgagctgagatcgtgccactgcattccagcctgcgtgacagagcgagactctgtcaaaaaaataaataaaatttaaaaaaagaagaagaaattattttcttaatttcattttcaggttttttatttatttctactatatggatacatgattgatttttgtatattgatcatgtatcctgcaaactagctaacatagtttattatttctctttttttgtggattttaaaggattttctacatagataaataaacacacataaacagttttacttctttcttttcaacctagactggatgcattttttgtttttgtttgtttgtttgctttttaacttgctgcagtgactagagaatgtattgaagaatatattgttgaacaaaagcagtgagagtggacatccctgctttccccctgattttagggggaatgttttcagtctttcactatttaatatgattttagctataggtttatcctagatccctgttatcatgttgaggaaattcccttctatttctagtttgttgagattttttaattcatgtgattgcgctatctggctttgctctcatttctccatttgtcgtgacacctttgttgacaccttcatttctgcattctcaattctatttcactggtctatggcagagaacacaaaatatggccagtggcctaaatccagcctactaccttttttttttttttgtaacattttactaacatagccattcccatgtgtttccatgtgtctgggctgcttttgcactctaatggcagagttaagaaattgtagcagagaccacaatgcctcaaatatttactctacagccctttataaaaacagtgtgccaactcctgatttatgaacttatcattatgtcaataccatactgtctttattactgtagttttataagtcatgacatcagataatgtaaatcctccaactttgtttttaatcaaaagtgttttggccatcctagatatactttgtattgccacataaatttgaagatcagcctgtcagtgtctacaaaatagcatgctaggattttgatagggattgtgtagaatctatagattaattagaggagaatgactatcttgacaatactgctgcccctctgtattcgtgggggattggttccacaacaacacccaccccccactcggcaacccctgaaacccccacatcccccagcttttttcccctgctaccaaaatccatggatgctcaagtccatataaaatgccatactatttgcatataacctctgcaatcctcccctatagtttagatcatctctagattacttataatactaataaaatctaaatgctatgtaaatagttgctatactgtgttgagggttttttgttttgttttgttttatttgtttgtttgtttgtattttaagagatggtgtcttgctttgttgcccaggctggagtgcagtggtgagatcatagcttactgcagcctcaaactcctggactcaaacagtcctcccacctcagcctcccaaagtgctgggatacaggtgtgacccactgtgcccagttattattttttatttgtattattttactgttgtattatttttaattattttttctgaatattttccatctatagttggttgaatcatggatgtggaacaggcaaatatggagggctaactgtattgcatcttccagttcatgagtatgcagtctctctgtttatttaaagttttagtttttctcaaccatgtttacttttcagtatacaagactttgacgttttttgttaaatgtatttgtaagtattttattatttgtgatgttatttaaaaagaaattgttgactgggcacagtggctcacgcctgtaatcccagcactttgggaggctgaggcgggcagatcacgaggtcaggagatcaagaccatcctggctaacatggtaaaaccccgtctctactaaaaatagaaaaaaattagccaggcgtggtggcgagtgcctgtagtcccagctactcgggaggctgaggcaggagaatggtgtgaacctgggaggcggagcttgcagtgagctgagatcgtgccactgcattccagcctgcgtgacagagcgagactctgtcaaaaaaataaataaaatttaaaaaaagaagaagaaattattttcttaatttcattttcaggttttttatttatttctactatatggatacatgattgatttttgtatattgatcatgtatcctgcaaactagctaacatagtttattatttctctttttttgtggattttaaaggattttctacatagataaataaacacacataaacagttttacttctttcttttcaacctagactggatgcattttttgtttttgtttgtttgtttgctttttaacttgctgcagtgactagagaatgtattgaagaatatattgttgaacaaaagcagtgagagtggacatccctgctttccccctgattttagggggaatgttttcagtctttcactatttaatatgattttagctataggtttatcctagatccctgttatcatgttgaggaaattcccttctatttctagtttgttgagattttttaattcatgtgattgcgctatctggctttgctctca t c g a g a t c t c g a g c t c g a t c t c t c g a g a t c g a t c g a g c g c g a g a t c g a g c g a g a • Human Genome contains ~ 3 G basepairs arranged in 46 chromosomes. • Two individuals are 99.9% the same. I.e. differ in ~ 3 M basepairs. • SNPs occur once every ~600 bp • Average gene in the human genome spans ~27Kb • ~50 SNPs per gene
Haplotype C A G Haplotypes T T G G C T C G A C A A C A G G T T C G T C A A C A G SNP SNP SNP Two individuals
Mutations Infinite Sites Assumption: Each site mutates at most once
Haplotype Pattern C A G T T T G A C A T G C T G T 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 1 At each SNP site label the two alleles as 0 and 1. The choice which allele is 0 and which one is 1 is arbitrary.
Recombination G T T C G A C A A C A T A C G T A T C T A T T A G T T C G A CT A T T A
Recombination The two alleles are linked, I.e., they are “traveling together” G T T C G A C A A C A T A C G T A T C T A T T A Recombination disrupts the linkage ? G T T C G A CT A T T A
Emergence of Variations Over Time Disease Mutation Common Ancestor present time Linkage Disequilibrium (LD) Variations in Chromosomes Within a Population
Disease-Causing Mutation 2,000 gens. ago 1,000 gens. ago Extent of Linkage Disequilibrium Time = present
A Data Compression Problem • Select SNPs to use in an association study • Would like to associate single nucleotide polymorphisms (SNPs) with disease. • Very large number of candidate SNPs • Chromosome wide studies, whole genome-scans • For cost effectiveness, select only a subset. • Closely spaced SNPs are highly correlated • It is less likely that there has been a recombination between two SNPs if they are close to each other.
Control Non-responder Disease Responder Allele 0 Allele 1 Marker A: Allele 0 = Allele 1 = Marker A is associated with Phenotype Association studies
Evaluate whether nucleotide polymorphisms associate with phenotype T T C T C T A G G G G A G A A A G G A C A A A A T T G T G G Association studies
T T T C C T G G A G A G G A G G A A A A C A A A G T T T G G Association studies
1 1 1 0 0 1 0 0 1 0 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 1 1 0 0 Association studies
Real Haplotype Data A region of Chr. 2245 Caucasian samples Our block-free algorithm Two different runs of the Gabriel el al Block Detection method + Zhang et al SNP selection algorithm
A Data Compression Problem The Minimum Informative Subset
A Data Compression Problem • Select SNPs to use in an association study • Would like to associate single nucleotide polymorphisms (SNPs) with disease. • Very large number of candidate SNPs • Chromosome wide studies, whole genome-scans • For cost effectiveness, select only a subset. • Closely spaced SNPs are highly correlated • It is less likely that there has been a recombination between two SNPs if they are close to each other.
Control Non-responder Disease Responder Allele 0 Allele 1 Marker A: Allele 0 = Allele 1 = Marker A is associated with Phenotype Association studies
Evaluate whether nucleotide polymorphisms associate with phenotype T T C T C T A G G G G A G A A A G G A C A A A A T T G T G G Association studies
T T T C C T G G A G A G G A G G A A A A C A A A G T T T G G Association studies
Hypothesis – Haplotype Blocks? • The genome consists largely of blocks of common SNPs with relatively little recombination within the blocks • Patil et al., Science, 2001; • Jeffreys et al., Nature Genetics, 2001; • Daly et al., Nature Genetics, 2001
Haplotype Block StructureLD-Blocks, and 4-Gamete Test Blocks 200 kb Sense genes DNA Antisense genes SNPs Haplotype blocks 1 2 3 4
Dynamic programming framework Partitioning a chromosome into blocks • Zhang et al. (PNAS, 2002). • Zhang et al. RECOMB, 2003 • H. I. Avi-Itzhak et al. PSB, 2003 • Sebastiani et al. PNAS 2003 • Patil et al., PNAS 2002. Parametric in block test • Solve a dynamic program • Optimal block partition requires the minimal number of blocks. • Within blocks one can select the SNPs that maximize entropy, diversity or r2 correlation
Data Compression A------A---TG-- G------G---CG-- A------G---TC-- A------G---CC-- G------A---TG-- ACGATCGATCATGAT GGTGATTGCATCGAT ACGATCGGGCTTCCG ACGATCGGCATCCCG GGTGATTATCATGAT Selecting Tagging SNPs in blocks Haplotype Blocks based on LD (Method of Gabriel et al.2002)
1 1 0 0 0 1 1 0 1 0 1 0 0 0 0 1 0 1 0 1 Informativeness s1 s2 s3 s4 s5 I({s1,s2}, s4) = 3/4
1 1 0 0 0 1 1 0 0 0 1 1 0 0 1 0 0 0 1 1 Informativeness s1 s2 s3 s4 s5 I({s3,s4},{s1,s2,s5}) = 3 S={s3,s4} is a Minimal Informative Subset
0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 1 1 1 s1 s2 s3 s4 s5 e6 Informativeness e5 s5 Graph theory insight Minimum Set Cover= Minimum Informative Subset e4 s4 e3 s3 s2 e2 s1 e1 Edges SNPs
0 1 0 0 1 0 1 0 0 1 1 0 1 0 0 0 0 1 1 1 s1 s2 s3 s4 s5 e6 Informativeness e5 s5 Graph theory insight Minimum Set Cover {s3, s4}= Minimum Informative Subset e4 s4 e3 s3 s2 e2 s1 e1 SNPs Edges