470 likes | 731 Views
SNPs, Haplotypes, Disease Associations. Algorithmic Foundations of Computational Biology II Course 1. Prof. Sorin Istrail. SNPs and the Human Genome: The Minimal Informative Subset. Overview. Introduction: SNPs, Haplotypes A Data Compression Problem:
E N D
SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail
Overview • Introduction: SNPs, Haplotypes • A Data Compression Problem: The Minimum Informative Subset • A New Measure: Informativeness
A Most Challenging Problem “None of the [advances of the 20th century medicine] depend on a deep knowledge of cellular processes or on any discoveries of molecular biology. Cancer is still treated by gross physical and chemical assaults on the offending tissue. Cardiovascular Disease is treated by surgery whose anatomical bases go back to the 19th century … Of course, intimate knowledge of the living cell and of basic molecular processes may be useful eventually.” Lewontin (1991)
Now “A decade later, molecular biology can claim very few successes for drugs in clinical use that were designed ab initio to control a specific component of a pathway linked to disease: these include the monoclonal antibody Herceptin, and the kinase inhibitor Gleevec.” Reik, Gregory and Urnov (2002)
Introduction SNPs, HAPLOTYPES
Single Nucleotide Polymorphism (SNP) • The most abundant type of polymorphism GATTTAGATCGCGATAGAG GATTTAGATCTCGATAGAG A SNP is a position in a genome at which two or more different bases occur in the population, each with a frequency >1%. The two alleles at the site are G andT
tttctccatttgtcgtgacacctttgttgacaccttcatttctgcattctcaattctatttcactggtctatggcagagaacacaaaatatggccagtggcctaaatccagcctactaccttttttttttttttgtaacattttactaacatagccattcccatgtgtttccatgtgtctgggctgcttttgcactctaatggcagagttaagaaattgtagcagagaccacaatgcctcaaatatttactctacagccctttataaaaacagtgtgccaactcctgatttatgaacttatcattatgtcaataccatactgtctttattactgtagttttataagtcatgacatcagataatgtaaatcctccaactttgtttttaatcaaaagtgttttggccatcctagatatactttgtattgccacataaatttgaagatcagcctgtcagtgtctacaaaatagcatgctaggattttgatagggattgtgtagaatctatagattaattagaggagaatgactatcttgacaatactgctgcccctctgtattcgtgggggattggttccacaacaacacccaccccccactcggcaacccctgaaacccccacatcccccagcttttttcccctgctaccaaaatccatggatgctcaagtccatataaaatgccatactatttgcatataacctctgcaatcctcccctatagtttagatcatctctagattacttataatactaataaaatctaaatgctatgtaaatagttgctatactgtgttgagggttttttgttttgttttgttttatttgtttgtttgtttgtattttaagagatggtgtcttgctttgttgcccaggctggagtgcagtggtgagatcatagcttactgcagcctcaaactcctggactcaaacagtcctcccacctcagcctcccaaagtgctgggatacaggtgtgacccactgtgcccagttattattttttatttgtattattttactgttgtattatttttaattattttttctgaatattttccatctatagttggttgaatcatggatgtggaacaggcaaatatggagggctaactgtattgcatcttccagttcatgagtatgcagtctctctgtttatttaaagttttagtttttctcaaccatgtttacttttcagtatacaagactttgacgttttttgttaaatgtatttgtaagtattttattatttgtgatgttatttaaaaagaaattgttgactgggcacagtggctcacgcctgtaatcccagcactttgggaggctgaggcgggcagatcacgaggtcaggagatcaagaccatcctggctaacatggtaaaaccccgtctctactaaaaatagaaaaaaattagccaggcgtggtggcgagtgcctgtagtcccagctactcgggaggctgaggcaggagaatggtgtgaacctgggaggcggagcttgcagtgagctgagatcgtgccactgcattccagcctgcgtgacagagcgagactctgtcaaaaaaataaataaaatttaaaaaaagaagaagaaattattttcttaatttcattttcaggttttttatttatttctactatatggatacatgattgatttttgtatattgatcatgtatcctgcaaactagctaacatagtttattatttctctttttttgtggattttaaaggattttctacatagataaataaacacacataaacagttttacttctttcttttcaacctagactggatgcattttttgtttttgtttgtttgtttgctttttaacttgctgcagtgactagagaatgtattgaagaatatattgttgaacaaaagcagtgagagtggacatccctgctttccccctgattttagggggaatgttttcagtctttcactatttaatatgattttagctataggtttatcctagatccctgttatcatgttgaggaaattcccttctatttctagtttgttgagattttttaattcatgtgattgcgctatctggctttgctctcatttctccatttgtcgtgacacctttgttgacaccttcatttctgcattctcaattctatttcactggtctatggcagagaacacaaaatatggccagtggcctaaatccagcctactaccttttttttttttttgtaacattttactaacatagccattcccatgtgtttccatgtgtctgggctgcttttgcactctaatggcagagttaagaaattgtagcagagaccacaatgcctcaaatatttactctacagccctttataaaaacagtgtgccaactcctgatttatgaacttatcattatgtcaataccatactgtctttattactgtagttttataagtcatgacatcagataatgtaaatcctccaactttgtttttaatcaaaagtgttttggccatcctagatatactttgtattgccacataaatttgaagatcagcctgtcagtgtctacaaaatagcatgctaggattttgatagggattgtgtagaatctatagattaattagaggagaatgactatcttgacaatactgctgcccctctgtattcgtgggggattggttccacaacaacacccaccccccactcggcaacccctgaaacccccacatcccccagcttttttcccctgctaccaaaatccatggatgctcaagtccatataaaatgccatactatttgcatataacctctgcaatcctcccctatagtttagatcatctctagattacttataatactaataaaatctaaatgctatgtaaatagttgctatactgtgttgagggttttttgttttgttttgttttatttgtttgtttgtttgtattttaagagatggtgtcttgctttgttgcccaggctggagtgcagtggtgagatcatagcttactgcagcctcaaactcctggactcaaacagtcctcccacctcagcctcccaaagtgctgggatacaggtgtgacccactgtgcccagttattattttttatttgtattattttactgttgtattatttttaattattttttctgaatattttccatctatagttggttgaatcatggatgtggaacaggcaaatatggagggctaactgtattgcatcttccagttcatgagtatgcagtctctctgtttatttaaagttttagtttttctcaaccatgtttacttttcagtatacaagactttgacgttttttgttaaatgtatttgtaagtattttattatttgtgatgttatttaaaaagaaattgttgactgggcacagtggctcacgcctgtaatcccagcactttgggaggctgaggcgggcagatcacgaggtcaggagatcaagaccatcctggctaacatggtaaaaccccgtctctactaaaaatagaaaaaaattagccaggcgtggtggcgagtgcctgtagtcccagctactcgggaggctgaggcaggagaatggtgtgaacctgggaggcggagcttgcagtgagctgagatcgtgccactgcattccagcctgcgtgacagagcgagactctgtcaaaaaaataaataaaatttaaaaaaagaagaagaaattattttcttaatttcattttcaggttttttatttatttctactatatggatacatgattgatttttgtatattgatcatgtatcctgcaaactagctaacatagtttattatttctctttttttgtggattttaaaggattttctacatagataaataaacacacataaacagttttacttctttcttttcaacctagactggatgcattttttgtttttgtttgtttgtttgctttttaacttgctgcagtgactagagaatgtattgaagaatatattgttgaacaaaagcagtgagagtggacatccctgctttccccctgattttagggggaatgttttcagtctttcactatttaatatgattttagctataggtttatcctagatccctgttatcatgttgaggaaattcccttctatttctagtttgttgagattttttaattcatgtgattgcgctatctggctttgctctca t c g a g a t c t c g a g c t c g a t c t c t c g a g a t c g a t c g a g c g c g a g a t c g a g c g a g a • Human Genome contains ~ 3 G basepairs arranged in 46 chromosomes. • Two individuals are 99.9% the same. I.e. differ in ~ 3 M basepairs. • SNPs occur once every ~600 bp • Average gene in the human genome spans ~27Kb • ~50 SNPs per gene
Haplotype C A G Haplotypes T T G G C T C G A C A A C A G G T T C G T C A A C A G SNP SNP SNP Two individuals
Mutations Infinite Sites Assumption: Each site mutates at most once
Haplotype Pattern C A G T T T G A C A T G C T G T 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 1 At each SNP site label the two alleles as 0 and 1. The choice which allele is 0 and which one is 1 is arbitrary.
Recombination G T T C G A C A A C A T A C G T A T C T A T T A G T T C G A CT A T T A
Recombination The two alleles are linked, I.e., they are “traveling together” G T T C G A C A A C A T A C G T A T C T A T T A Recombination disrupts the linkage ? G T T C G A CT A T T A
Emergence of Variations Over Time Disease Mutation Common Ancestor present time Linkage Disequilibrium (LD) Variations in Chromosomes Within a Population
Disease-Causing Mutation 2,000 gens. ago 1,000 gens. ago Extent of Linkage Disequilibrium Time = present
A Data Compression Problem The Minimum Informative Subset
A Data Compression Problem • Select SNPs to use in an association study • Would like to associate single nucleotide polymorphisms (SNPs) with disease. • Very large number of candidate SNPs • Chromosome wide studies, whole genome-scans • For cost effectiveness, select only a subset. • Closely spaced SNPs are highly correlated • It is less likely that there has been a recombination between two SNPs if they are close to each other.
Control Non-responder Disease Responder Allele 0 Allele 1 Marker A: Allele 0 = Allele 1 = Marker A is associated with Phenotype Association studies
Evaluate whether nucleotide polymorphisms associate with phenotype T T C T C T A G G G G A G A A A G G A C A A A A T T G T G G Association studies
T T T C C T G G A G A G G A G G A A A A C A A A G T T T G G Association studies
1 1 1 0 0 1 0 0 1 0 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 1 1 0 0 Association studies
0 0 0 1 1 1 1 0 0 1 D-graph of a SNP For a SNP swe associate a bipaprtite graph. Nodes: the set of haplotypes. Edges: the set of pairs of haplotypes with different alleles at s. 1 1 0 0 0 s1 s2
0 0 0 1 1 1 1 0 0 1 D-graph of a set of SNPs For a set of SNPs Swe associate a bipaprtite graph. Nodes: the set of haplotypes. Edges: the set of pairs of haplotypes with different alleles at some SNP s in S. 1 1 0 0 0 s1 s2
0 0 1 0 1 1 1 0 1 0 SNP Selection Red SNP is equivalent to Blue SNP 1 1 0 0 0
0 0 1 0 1 1 1 0 1 0 Red SNPs predict Green SNP SNP Selection 1 1 0 0 0
0 0 1 0 1 1 1 0 1 0 Minimal Informative Subset Data Compression 1 1 0 0 0
Hypothesis – Haplotype Blocks? • The genome consists largely of blocks of common SNPs with relatively little recombination within the blocks • Patil et al., Science, 2001; • Jeffreys et al., Nature Genetics, 2001; • Daly et al., Nature Genetics, 2001
Haplotype Block StructureLD-Blocks, and 4-Gamete Test Blocks 200 kb Sense genes DNA Antisense genes SNPs Haplotype blocks 1 2 3 4
Four Gamete Block Test • Hudson and Kaplan 1985 A segment of SNPs is a block if between every pair of SNPs at most 3 out of the 4 gametes (00, 01,10,11) are observed. 0 0 1 0 1 1 1 1 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 1 BLOCK VIOLATES THE BLOCK DEFINITION
Finding Recombination Hotspots:Many Possible Partitions into Blocks A C T A G A T A G C C T G T T C G A C A A C A T A C T C T A T G A T C G G T T A T A C G A C A T A C T C T A T A G T A T A C T A G C T G G C A T All four gametes are present:
The final result is a minimum-size set of sites crossing all constraints. A C T A G A T A G C C T G T T C G A C A A C A T Find the left-most right endpoint of any constraint and mark the site before it a recombination site. A C T C T A T G A T C G Eliminate any constraints crossing that site. Repeat until all constraints are gone. G T T A T A C G A C A T A C T C T A T A G T A T A C T A G C T G G C A T
Data Compression A------A---TG-- G------G---CG-- A------G---TC-- A------G---CC-- G------A---TG-- ACGATCGATCATGAT GGTGATTGCATCGAT ACGATCGGGCTTCCG ACGATCGGCATCCCG GGTGATTATCATGAT Selecting Tagging SNPs in blocks Haplotype Blocks based on LD (Method of Gabriel et al.2002)
A New Measure Informativeness
0 0 1 1 1 0 0 0 0 1 Informativeness s h1 h2
1 1 0 0 0 1 1 0 1 0 1 0 0 0 0 1 0 1 0 1 Informativeness s1 s2 s3 s4 s5 I(s1,s2) = 2/4 = 1/2
1 1 0 0 0 1 1 0 1 0 1 0 0 0 0 1 0 1 0 1 Informativeness s1 s2 s3 s4 s5 I({s1,s2}, s4) = 3/4
1 1 0 0 0 1 1 0 0 0 1 1 0 0 1 0 0 0 1 1 Informativeness s1 s2 s3 s4 s5 I({s3,s4},{s1,s2,s5}) = 3 S={s3,s4} is a Minimal Informative Subset
0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1 1 s1 s2 s3 s4 s5 e6 Informativeness e5 s5 Graph theory insight Minimum Set Cover= Minimum Informative Subset e4 s4 e3 s3 s2 e2 s1 e1 Edges SNPs
0 1 0 0 1 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 s1 s2 s3 s4 s5 e6 Informativeness e5 s5 Graph theory insight Minimum Set Cover {s3, s4}= Minimum Informative Subset e4 s4 e3 s3 s2 e2 s1 e1 SNPs Edges
Real Haplotype Data A region of Chr. 2245 Caucasian samples Our block-free algorithm Two different runs of the Gabriel el al Block Detection method + Zhang et al SNP selection algorithm