770 likes | 911 Views
Optimization Problems for Polymorphisms of Single Nucleotides. Polymorphisms. A polymorphism is a feature. Polymorphisms. A polymorphism is a feature - common to everybody. Polymorphisms. A polymorphism is a feature - common to everybody - not identical in everybody.
E N D
Optimization Problems for Polymorphisms of Single Nucleotides
Polymorphisms A polymorphism is a feature
Polymorphisms A polymorphism is a feature - common to everybody
Polymorphisms A polymorphism is a feature - common to everybody - not identical in everybody
Polymorphisms A polymorphism is a feature - common to everybody - not identical in everybody - the possible variants (alleles) are just a few
Polymorphisms A polymorphism is a feature - common to everybody - not identical in everybody - the possible variants (alleles) are just a few E.g. think of eye-color
Polymorphisms A polymorphism is a feature - common to everybody - not identical in everybody - the possible variants (alleles) are just a few E.g. think of eye-color Or blood-type for a feature not visible from outside
At DNA level, a polymorphism is a sequence of nucleotides varying in a population.
At DNA level, a polymorphism is a sequence of nucleotides varying in a population. The shortest possible sequence has only 1 nucleotide, hence Single Nucleotide Polymorphism (SNP)
At DNA level, a polymorphism is a sequence of nucleotides varying in a population. The shortest possible sequence has only 1 nucleotide, hence Single Nucleotide Polymorphism (SNP) atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac
At DNA level, a polymorphism is a sequence of nucleotides varying in a population. The shortest possible sequence has only 1 nucleotide, hence Single Nucleotide Polymorphism (SNP) atcggcttagttagggcacaggacgtac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac
- SNPs are predominant form of human variations - On average one every 1,000 bases - Used for drug design, study disease, forensic, evolutionary... atcggcttagttagggcacaggacgtac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac
- Multimillion dollar SNP consortium project - 1st step: buildmaps of severalthousandSNPs - Goal: associate SNPs (or group of SNPs) to geneticdiseases atcggcttagttagggcacaggacgtac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac
HOMOZYGOUS: same allele on both chromosomes atcggcttagttagggcacaggacgtac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac
HOMOZYGOUS: same allele on both chromosomes atcggcttagttagggcacaggacgtac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles atcggcttagttagggcacaggacgtac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles atcggcttagttagggcacaggacgtac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles HAPLOTYPE: chromosome content at SNP sites atcggcttagttagggcacaggacgtac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles HAPLOTYPE: chromosome content at SNP sites atcggcttagttagggcacaggacgtac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgt atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles HAPLOTYPE: chromosome content at SNP sites ct cg ag at at at ct ag ag cg ag ag ag cg
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles HAPLOTYPE: chromosome content at SNP sites GENOTYPE: “union” of 2 haplotypes ct OcE cg ag OaE at at OaOt at ct EE ag ag EOg cg ag ag OaOg OgE ag cg
CHANGE OF SYMBOLS: each SNP onlytwovalues in a poplulation (bio). Call them1 and O. Also, call *the factthat a site isheterozygous HAPLOTYPE: string over 1,O GENOTYPE: string over 1,O,* ct OcE cg ag OaE at at OaOt at ct EE ag ag EOg cg ag ag OaOg OgE ag cg
CHANGE OF SYMBOLS: each SNP onlytwovalues in a poplulation (bio). Call them1and O. Also, call *the factthat a site isheterozygous HAPLOTYPE: string over 1,O GENOTYPE: string over 1,O,* o1 o* oo 1o 1* 11 11 11 11 o1 ** 1o 1o *o oo 1o 1o *o *o 1o oo
THE HAPLOTYPING PROBLEM Single Individual: Given genomic data of one individual, determine 2 haplotypes (one per chromosome) Population : Given genomic data of k individuals, determine (at most) 2k haplotypes (one per chromosome/indiv.) For the individual problem, input is erroneous haplotype data, from sequencing For the population problem, data is ambiguous genotype data, from screening OBJ is lead by Occam’s razor: find minimum explanation of observed data under given hypothesis (a.k.a. parsimony principle)
Theory and Results Single individual - PolynomialAlgorithms for gaplesshaplotyping(L, Bafna, Istrail, Lippert, Schwartz 01 & Bafna, L, Istrail, Rizzi 02) - Polynomial Algorithms for bounded-length gapped haplotyping (BLIR 02) - NP-hardness for general gapped haplotyping (LBILS 01) Population - APX-hardness (Gusfield 00) - Reduction to Graph-Theoretic model and I.P. approach(Gusfield 01) -New formulations and DiseaseDetection(L, Ravi, Rizzi, 02) - Exactalgorithms for min-sizesolution (L,Serafini 2011) - Heuristics(Tininini, L, Bertolazzi 2010)
Shotgun Assembly of a Chromosome [ Webber and Myers, 1997] ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA fragmentation ACTGA GATTT GCCTAG CTATCTT ATAGATA GAGATTTC TAGAAATC TGAGCCTAG TAGAGATTTC TCCTAAAGAT CGCATAGATA sequencing TGAGCCTAG GATTT GCCTAG CTATCTT ATAGATA GAGATTTCTAGAAATC ACTGA TAGAGATTTC TCCTAAAGAT CGCATAGATA assembly ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT
MAIN ERROR SOURCES -Sequencing errors: ACTGCCTGGCCAATGGAACGGACAAG CTGGCCAAT CATTGGAAC AATGGAACGGA -Contaminants
Givenerrors, the data may be inconsistent with exactly 2 haplotypes Hence, assembler is unable to build 2 chromosomes PROBLEM: Find and remove the errors so that the data becomes consistent with exactly 2 haplotypes
The data: a SNP matrix ACTGAAAGCGA ACTAGAGACAGCATG ACTGATAGC GTAGAGTCA ACTG TCGACTAGA CATG ACTGA CGATCCATCG TCAGC ACTGAAA ATCGATC AGCATG ACTGAAAGCGAACTAGAGACAGCATG ACTGATAGCGTAGAGTCA ACTGTCGACTAGACATG ACTGACGATCCATCGTCAGC ACTGAAAATCGATCAGCATG 11O OO1 1 11 1 O
Snips 1,..,n 1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 31 1 O 1 1 - - - - 4 O O1 - - - - O - 5 - - - - - - - 1 O 6 - - - - O OO1 - Fragments 1,..,m
Snips 1,..,n 1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 31 1O 1 1 - - - - 4 O O1 - - - - O - 5 - - - - - - - 1 O 6 - - - - O OO1 - Fragments 1,..,m Fragment conflict: can’t be on same haplotype
Snips 1,..,n 1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 31 1O 1 1 - - - - 4 O O1 - - - - O - 5 - - - - - - - 1 O 6 - - - - O OO1 - Fragments 1,..,m Fragment conflict: can’t be on same haplotype Fragment Conflict Graph GF(M) 1 4 We have 2 haplotypes iff GF is BIPARTITE 5 2 6 3
Snips 1,..,n 1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 31 1O 1 1 - - - - 4 O O1 - - - - O - 5 - - - - - - - 1 O 6 - - - - O OO1 - Fragments 1,..,m PROBLEM (Fragment Removal): make GF Bipartite 1 4 5 2 6 3
Snips 1,..,n 1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 31 1 O 1 1 - - - - 4 O O1 - - - - O - 5 - - - - - - - 1 O 6 - - - - O OO1 - Fragments 1,..,m PROBLEM (Fragment Removal): make GF Bipartite 1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 4 O O1 - - - - O - 31 1 O 1 1 - - - - 5 - - - - - - - 1 O 1 4 5 2 O O1 O 1 1 O O1 6 3 1 1 O 1 1 - - 1 O
Removing fewest fragments is equivalent to maximum induced bipartite subgraph NP-complete [Yannakakis, 1978a, 1978b; Lewis, 1978] O(|V|(log log |V|/log |V|)2)-approximable [Halldórsson, 1999] not O(|V|)-approximable for some [Lund and Yannakakis, 1993] Are there cases of M for which GF(M) is easier? YES: the gapless M ---O11OO1O1O1OO1--- gapless ---O11OO---O1OO1--- gap ---O11--1O----O1--- 2 gaps
Why gaps? Sequencingerrors (don’t call with lowconfidence) ---OO11?11--- ===> ---OO11-11---
Why gaps? Sequencingerrors (don’t call with lowconfidence) ---OO11?11--- ===> ---OO11-11--- Celera’s mate pairs attcgttgtagtggtagcctaaatgtcggtagaccttga attcgttgtagtggtagcctaaatgtcggtagaccttga
THEOREM For a gapless M, the Min Fragment Removal Problem is Polynomial NOTE: Does not need to be gapless. Enough if it can be sorted to become such (Consecutive Ones Property, Booth and Lueker, 1976)
3 An O(nm + n ) D.P. algo 1 - O O1 1 O O - - 2 - - 1 O 1 1 O - - 3 - - - 1 1 O - - - 4 - - - - O O1 O - 5 - - - - - 1 O 1 O
3 An O(nm + n ) D.P. algo LFT(i) RGT(i) 1 - O O1 1 O O - - 2 - - 1 O 1 1 O - - 3 - - - 1 1 O - - - 4 - - - - O O1 O - 5 - - - - - 1 O 1 O sort according to LFT
3 An O(nm + n ) D.P. algo LFT(i) RGT(i) 1 - O O1 1 O O - - 2 - - 1 O 1 1 O - - 3 - - - 1 1 O - - - 4 - - - - O O1 O - 5 - - - - - 1 O 1 O sort according to LFT D(i;h,k) := min cost to solve up to row i, with k, h not removed and put in different haplotypes, and maximizing RGT(k), RGT(h) { D(i-1; h,k) if i, k compatible and RGT(i) <= RGT(k) or i, h compatible and RGT(i) <= RGT(h) 1 + D(i-1; h, k) otherwise D(i; h,k) = OPT is min h,k D( n; h, k ) and can be found in time O(nm + n^3)
WITH GAPS….. Th: NP-Hard if 2 gaps per fragment proof: (simple) use factthat for every G thereis M s.t. G = GF(M) and reduce from Max Bip. InducedSubgraphon 3-regular graphs (in eachrow, max 3 non-bit, hencemax 2 gaps)
WITH GAPS….. Th: NP-Hard if 2 gaps per fragment proof: (simple) use factthat for every G thereis M s.t. G = GF(M) and reduce from Max Bip. InducedSubgraphon 3-regular graphs (in eachrow, max 3 non-bit, hencemax 2 gaps) Th: NP-Hard if even 1 gap per fragment proof: technical. reduction from MAX2SAT
WITH GAPS….. Th: NP-Hard if 2 gaps per fragment proof: (simple) use factthat for every G thereis M s.t. G = GF(M) and reduce from Max Bip. InducedSubgraphon 3-regular graphs (in eachrow, max 3 non-bit, hencemax 2 gaps) Th: NP-Hard if even 1 gap per fragment proof: technical. reduction from MAX2SAT But, gaps must be long for problem to be difficult. We have O( 2 mn + 2 n ) D.P. for MFR on matrix with total gaps length L 2L 3L 3
What for MFR with gaps? Why not ILP... 1/2 1 0 2 5 1/3 4 3 1/4 1/2
What for MFR with gaps? Why not ILP... 1/2 1 1 1 0 2 2 5 5 2 5 1/3 4 4 3 3 4 3 1/4 1/2
What for MFR with gaps? Why not ILP... 1/2 1 1 5/12 5/12 1 0 2 2 5 5 2 5 1/3 4 4 3 3 4 3 1/4 1/2
What for MFR with gaps? Why not ILP... 1/2 1 1 5/12 5/12 1 0 2 2 5 5 2 5 1/3 4 4 3 3 4 3 1/4 1/2