410 likes | 678 Views
SNP and Haplotype Analysis Algorithms and Applications. Eran Halperin International Computer Science Institute Berkeley, California. “Computational Genetics”. The Human Genome Project.
E N D
SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California CPM 2006
“Computational Genetics” CPM 2006
The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering the genome in…a working draft of the human sequence.” “But our work previously has shown… that having one genetic code is important, but it's not all that useful.” (referring to comparative genomics). “I would be willing to make a predication that within 10 years, we will have the potential of offering any of you the opportunity to find out what particular genetic conditions you may be at increased risk for…” Washington, DC June, 26, 2000. CPM 2006
Individually Tailored Medicine People react to different drugs indifferent ways. The vision: a simple DNA test would help todetermine which medicine to prescribe. CPM 2006
International consortium that aims in genotyping the genome of 270 individuals from four different populations. • Launched in 2002. First phase was finished in October (Nature, 2005). CPM 2006
Motivation Genetic Factors (50%) Complexdisease Environmental Factors (50%) Multiple genes may affect the disease. Therefore, the effect of every single gene may be negligible. CPM 2006
Disease Association StudiesThe search for genetic factors • Comparing the DNA contents of two populations: • Cases - individuals carrying the disease. • Controls - background population. A significant discrepancy between the two populations is an evident to a causal gene. CPM 2006
Associated SNP Where should we look? Usually SNPs are bi-allelic (only two letters appear). SNP= Single Nucleotide Polymorphism Cases: AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTC AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTC AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC Controls: AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTC AGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC CPM 2006
Genotyping Technology • Extracting the allele information for a SNP from a DNA sample. • Considerable genotyping costs reductions in the last couple of years. • Current cost allows for the genotyping of 500,000 SNPs for ~$1000 (compared to ~50 cents per SNP 3-4 years ago). CPM 2006
Computational Challenges CPM 2006
Haplotypes • SNPs in physical proximity are correlated. • A sequence of alleles along a chromosome are called haplotypes. CPM 2006
Haplotype Block Structure (Daly et al., 2001) Block 6 from Chromosome 5q31 CPM 2006
000 001 111 Tag SNPs Haplotypes as Proxies for Rare SNPs Common haplotypes: • 011000111 (23% of population) • 000001111 (55% of population) • 111111111 (14% of population) CPM 2006
Tag SNP Selection • Input: a set of genotypes • Goal: find a set of t tag SNPs such that using these SNPs only, the error rate for the prediction of all other SNPs is minimized. Formulation by [H., Kimmel, Shamir, 05’] (STAMPA) CPM 2006
Correlations between SNPs Tag SNPs Cases: AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA Controls: AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA CPM 2006
intermediate SNPs SNP j SNP k Basic Assumption Given two SNPs, the probabilities of the values at any intermediate SNPs do not change if we know the values of additional distal ones. CPM 2006
intermediate SNPs SNP j SNP k Test genoteype STAMPA (Selection of TAg SNPs to Maximize Prediction Accuracy) 1. Put aside one test genotype. Use the rest of the data to develop a majority rule for each pair of SNPs to predict intermediate SNPs values. 2. Average prediction error over all test genotypes gives a score to the pair j and k. 3. Apply dynamic programming to obtain best set of tag SNPs. CPM 2006
Comparison: STAMPA vs. ldSelect x - STAMPA, - ldSelect 52 sets of Yoruba genotypes (Gabriel et al., 2002). CPM 2006
The haplotype ancestral structure of two subtypes of NHL. The trees are automatically generated by HAP (H., Eskin, 04’). CPM 2006
Genotype T C C ì ü ì ü ì ü mother chromosome father chromosome A CG í ý í ý í ý G A A î þ î þ î þ ATACGA AGCCGC AGACGA ATCCGC Possible phases: …. Phasing Haplotypes • Cost effective genotyping technology gives genotypes and not haplotypes. ATCCGA AGACGC CPM 2006
Public Genotype Data Growth Perlegen Data Science 1,570,000 SNPs 100,000,000 genotypes HapMap Phase 2 5,000,000+ SNPs 600,000,000+ genotypes TSC Data Nucleic Acids Research 35,000 SNPs 4,500,000 genotypes NCBI dbSNP Genome Research 3,000,000 SNPs 286,000,000 genotypes Daly et al. Nature Genetics 103 SNPs 40,000 genotypes Gabriel et al. Science 3000 SNPs 400,000 genotypes 2001 2002 2003 2004 2005 2006 - HAP’s speed allows it to phase whole-genome datasets - HAP is very accurate (Marchini et al., 2006). CPM 2006
HAP Phasing Model 00000 • A directed phylogenetic tree. • {0,1} alphabet. • Each site mutates at mostonce. • No recombination. • Goal: Finding a phase that fits the tree modelFormulation: [Gusfield, 2003] 2 01000 1 5 11000 01001 3 11100 4 11110 CPM 2006
2 01000 1 5 11000 01001 3 4 11100 01011 Example 00000 Genotypes 02022 22200 21222 21200 02000 01022 Haplotypes 00000 01000 11100 01011 Given the tree and the haplotypes the phase is unique CPM 2006
Phasing via Greedy • A simple heuristic: • Find a haplotype that is compatible with as many genotypes as possible. • Assign the haplotype for these genotypes. • Continue with the rest of the genotypes. • Intuition: Haplotypes with missing data. CPM 2006
Haplotypes with missing data Input: 111*11*1 00*01*1* 01*000*0 11*11*11 *111**00 1111*11* 01*00010 Output: 11111111 00001111 01000010 11111111 11110000 11111111 01000010 Goal: Find a maximum likelihood phase. CPM 2006
Greedy Analysis (H., Karp, 2005) • Maximum likelihood == minimum entropy solution. • Entropy(Greedy) < Entropy(OPT) + 3. • Can be viewed as a variant of set cover. CPM 2006
Mother, Father, Child Trios • Advantages: • Better phasing results(Marchini et al., 06’). • Population stratification(Spielman et al., 93’). • Disadvantage: • 50% more expensive (and thus, reduces power). CPM 2006
10011? 11111? 1??11? 1??11? 10?11? 11?11? 1??11? 1??11? ?100?? ?100?? 1100?? 0100?? 11000? 01001? 1100?? 0100?? 1?0??? 1?0??? 100??? 110??? 10011? 11000? 1?0??? 1?0??? Inferring Haplotypes From Trios Parent 1 122112 Parent 2 210022 120222 Child Assumption: No recombination CPM 2006
Genotyping Trios via DNA pools[Beckman, Abel, Braun, H.] M F C CPM 2006
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Mother transmitted allele A A A A A A A A G G G G G G G G Mother untransmitted allele A A A A G G G G A A A A G G G G Father transmitted allele A A G G A A G G A A G G A A G G Father untransmitted allele A G A G A G A G A G A G A G A G Father and Child pool – allele frequency 0 1 2 3 0 1 2 3 1 2 3 4 1 2 3 4 Mother and Child pool – allele frequency 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 • Every configuration has a different pair of values. • Except for configurations 7 and 10 (het-het-het). CPM 2006
Genotyping Unrelated Individuals Edge size pool size (accuracy) Vertex degree amount of DNA used CPM 2006
An algebraic view CPM 2006
For every m, what is the largest n, so that m equations uniquely determine the n {0,1,2} variables? For every m, what is the largest n for which A {0,1}mn, s.t. x,x’ {0,1,2}n , Ax=Ax’ x=x’ CPM 2006
Lower Bound • A random matrix A. • For every x {-2,-1,0,1,2}n, Aix=0 with prob. O(k-0.5) where k is the number of non-zero elements. • Since the rows are independent, the probability that Ax = 0 is O(k-m/2). • Using union bound, n=(m log m). CPM 2006
Upper Bound • Counting argument: • There are at most (2n)m different values that Ax can take. • There are 3n values for x. • 3n< (2n)m and so n < O(m log m). CPM 2006
Further Challenges • Population stratification • In case/control studies and in family based studies. • Admixed populations. • Other pooling schemes • Practical considerations: error rates, missing data, scalability, etc. • Inferring evolutionary processes (e.g. selection, recombination rate, haplotype ancestry, etc.). CPM 2006
Summary • Exciting times in genetics: changes in medicine may be felt in our lifetime. • An opportunity for Computer Scientists to have a huge impact. • An interdisciplinary work is needed. It involves computer science,statistics, genetics, biology,and medicine. CPM 2006
UCSD Eleazar Eskin. Tel-Aviv U. Ron Shamir Gad Kimmel Noga Alon HIIT MattiKaariainen SequenomInc. Andreas Braun Ken Abel Perlegen Sciences David Hinds David Cox UC Berkeley Richard Karp Chris Skibola MPI ReneBeier CHORI KennyBeckman Acknowledgement CPM 2006
Gracies per la vostra atencio!!! CPM 2006