480 likes | 622 Views
Outline. Shotgun assembly. Haploid separation and errors. Formalizing the problem. Gaps and gapless. Polynomial and NP-complete cases. Main result in a nutshell: Gapless problems are Polynomial , with gaps they are NP-complete. Shotgun Assembly. [ Webber and Myers, 1997].
E N D
Outline • Shotgun assembly • Haploid separation and errors • Formalizing the problem • Gaps and gapless • Polynomial and NP-complete cases Main result in a nutshell: Gapless problems are Polynomial, with gaps they are NP-complete.
Shotgun Assembly [ Webber and Myers, 1997] ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA fragmentation ACTGA GATTT GCCTAG CTATCTT ATAGATA GAGATTTC TAGAAATC TGAGCCTAG TAGAGATTTC TCCTAAAGAT CGCATAGATA sequencing TGAGCCTAG GATTT GCCTAG CTATCTT ATAGATA GAGATTTCTAGAAATC ACTGA TAGAGATTTC TCCTAAAGAT CGCATAGATA assembly ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT
2 Chromosomes instead of one consensus: ACTGAAAGCGATCGATCGACTAGAGACAGATAG ACTGATAGCGATCCATCGATTAGAGTCAGATAG ATTAGAGTCA GAAAGC TAGAGTCAGA CTAGAGAC ATCGATAG ATCCAT • How do we find both chromosomes from the • isolated fragments?
2 Chromosomes instead of one consensus: ACTGAAAGCGATCGATCGACTAGAGACAGATAG ACTGATAGCGATCCATCGATTAGAGTCAGATAG ATTAGAGTCA GAAAGC TAGAGTCAGA CTAGAGAC ATCGATAG ATCCAT • We use SNPs (Single Nucleotide Polymorphisms)
actgaAagcgatcGatcgaCtagagA_______ actgaTagcgatcCatcgaGtagagT_______ Haplotype 1 Haplotype 2 This individual is (AGCA, TCGT) Genotype SNPs frequency: 1 about every 1,000 bp SNPs useful for: - disease association - forensic - assembly of genomes - evolutionary
ERROR SOURCES Sequencing errors: ACTGCCTGGCCAATGGAACGGACAAG CTGGCCAAT CATTGGAAC AATGGAACGGA Paralogous regions: ACAAACCCTTTGGGACT … CTAGTAAACCCTATGGGGA AAACCCTT TAAACCCT CTATGGGA CCTATGG CTTTGGGACT ACCCTATGGG
The data: a SNP matrix ACTGAAAGCGA ACTAGAGACAGCATG ACTGATAGC GTAGAGTCA ACTG TCGACTAGA CATG ACTGA CGATCCATCG TCAGC ACTGAAA ATCGATC AGCATG ACTGAAAGCGAACTAGAGACAGCATG ACTGATAGCGTAGAGTCA ACTGTCGACTAGACATG ACTGACGATCCATCGTCAGC ACTGAAAATCGATCAGCATG XXO OOX X XX X O
Snips 1,..,n 1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O 6 - - - - O O O X - Fragments 1,..,m
Snips 1,..,n 1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O 6 - - - - O O O X - Fragments 1,..,m Fragment conflict: can’t be on same haplotype
Snips 1,..,n 1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O 6 - - - - O O O X - Fragments 1,..,m Fragment conflict: can’t be on same haplotype Fragment Conflict Graph GF(M) 1 4 We have 2 haplotypes iff GF is BIPARTITE 5 2 6 3
Snips 1,..,n 1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O 6 - - - - O O O X - Fragments 1,..,m PROBLEM (Fragment Removal): make GF Bipartite 1 4 5 2 6 3
Snips 1,..,n 1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O 6 - - - - O O O X - Fragments 1,..,m PROBLEM (Fragment Removal): make GF Bipartite 1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 4 O O X - - - - O - 3 X X O X X - - - - 5 - - - - - - - X O 1 4 5 2 O O X O X X O O X 6 3 X X O X X - - X O
Removing fewest fragments is equivalent to maximum induced bipartite subgraph NP-complete [Yannakakis, 1978a, 1978b; Lewis, 1978] O(|V|(log log |V|/log |V|)2)-approximable [Halldórsson, 1999] not O(|V|)-approximable for some [Lund and Yannakakis, 1993] Note: for every G there is M s.t. G = GF(M) Are there cases of M for which GF(M) is easier? YES: the gapless M ---OXXOO---OXOOX--- gap ---OXXOOXOXOXOOX--- gapless ---OXX--XO----OX--- 2 gaps
Why gaps? Sequencing errors (don’t call with low confidence) ---OOXX?XX--- ===> ---OOXX-XX--- Celera’s mate pairs attcgttgtagtggtagcctaaatgtcggtagaccttga attcgttgtagtggtagcctaaatgtcggtagaccttga Why not gaps? EST (Expressed Sequence Tagged) mapping
The gapless case: a reduction to Min Cost Flow OXXO--- -XXOX-- -OXXX-- ---OXX- ---XXO- ----XOX ----XXO
The gapless case: a reduction to Min Cost Flow OXXO--- -XXOX-- -OXXX-- ---OXX- ---XXO- ----XOX ----XXO
The gapless case: a reduction to Min Cost Flow OXXO--- -XXOX-- -OXXX-- ---OXX- ---XXO- ----XOX ----XXO
The gapless case: a reduction to Min Cost Flow s 1,0 OXXO--- -XXOX-- -OXXX-- ---OXX- ---XXO- ----XOX ----XXO 2,0 1,1 t 1,1 Put capacity of 1 on each node: find max circulation
Same construction, different costs, finds the 2 longest compatible haplotypes Does not need to be gapless. Enough if it can be sorted to become such (Consecutive Ones Property, Booth and Lueker, 1976) NP Complete if 2 gaps per fragment (Max Bip. Induced Subgraph on 3-regular graphs) THEOREM : NP Complete if even 1 gap per fragment (reduction from MAX2SAT)
The fragment removal is good to get rid of contaminants. However, we may want to keep all fragments and correct errors otherwise A dual point of view is to disregard some SNPs and keep the largest subset sufficient to reconstruct the haplotypes All fragments get assigned to one of the two haplotypes. We describe the min SNP removal problem: remove the fewest number of columns from M so that the fragment graph becomes bipartite.
SNP conflicts - - - O X X O O - - O X O X - - - X X XO X X - - - - O O X - - - O O - - - - - - - X X O - - - - O O O X -
SNP conflicts - - - O X X O O - - O X O X - - - X X XO X X - - - - O O X - - - O O - - - - - - - X X O - - - - O O O X - OK
SNP conflicts - - - O X X O O - - O X O X - - - X X XO X X - - - - O O X - - - O O - - - - - - - X X O - - - - O O O X - OK
SNP conflicts - - - O X X O O - - O X O X - - - X X XO X X - - - - O O X - - - O O - - - - - - - X X O - - - - O O O X - OK
SNP conflicts - - - O X X O O - - O X O X - - - X X XO X X - - - - O O X - - - O O - - - - - - - X X O - - - - O O O X - CONFLICT !
SNP conflicts - - - O X X O O - - O X O X - - - X X XO X X - - - - O O X - - - O O - - - - - - - X X O - - - - O O OX - CONFLICT !
SNP conflicts - - - O X X O O - - O X O X - - - X X XO X X - - - - O O X - - - O O - - - - - - - X X O - - - - O O O X - SNP conflict graph GS(M) 1 node for each SNP (column) edge between conflicting SNPs
SNP conflicts 1 2 3 4 5 6 7 8 9 - - - O X X O O - - O X O X - - - X X XO X X - - - - O O X - - - O O - - - - - - - X X O - - - - O O O X - 1 4 8 2 5 7 3 6 9
THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set THEOREM 2 For a gapless M, GS(M) is a perfect graph COROLLARY For a gapless M, the min SNP removal problem is polynomial
THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set PROOF (sketch): by minimal counterexample --OOXXOO--------- ----OOXOOXOXXO--- --------XXOXOXXX- ----XXOOXOXXO---- -------XOOOX----- ------XXXXXO----- --XXOXXOXOO------ Take an odd cycle in GF
THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set PROOF (sketch): by minimal counterexample --OOXXOO--------- ----OOXOOXOXXO--- --------XXOXOXXX- ----XXOOXOXXO---- -------XOOOX----- ------XXXXXO----- --XXOXXOXOO------ There is a generic structure of hor-vert cycle
THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set PROOF (sketch): by minimal counterexample --OOXXOO--------- ----OOXOOXOXXO--- --------XXOXOXXX- ----XXOOXOXXO---- -------XOOOX----- ------XXXXXO----- --XXOXXOXOO------ Remove extra symbols. Still a gapless odd cycle
THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set PROOF (sketch): by minimal counterexample --OOX------------ ----OOXOOXOXXO--- ----------OXOX--- ----------X------ ----------O------ ----------X------ --XXOXXOXOO------ Remove extra symbols. Still a gapless odd cycle
THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set PROOF (sketch): by minimal counterexample --OOX------------ ----OOXOOXOXXO--- ----------OXOX--- ----------X------ ----------O------ ----------X------ --XXOXXOXOO------ “vertical lines” There cannot be only one vertical line in odd cycle We merge rightmost and next to reduce them by 1
THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set Must be X PROOF (sketch): by minimal counterexample --OOX------------ ----OOXOOXOXXO--- ----------OXOX--- ----------X------ ----------O------ ----------X------ --XXOXXOXOO------
THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set Must be X PROOF (sketch): by minimal counterexample --OOX------------ ----OOXOOXXXXO--- ----------OXOX--- ----------X------ ----------O------ ----------X------ --XXOXXOXOO------ Merge the rightmost lines
THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set PROOF (sketch): by minimal counterexample --OOX------------ ----OOXOOXX------ ----------O------ ----------X------ ----------O------ ----------X------ --XXOXXOXOO------ Merge the rightmost lines Repeat
THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set PROOF (sketch): by minimal counterexample --OOX------------ ----OOXOOXX------ ----------O------ ----------X------ ----------O------ ----------X------ --XXOXXOXOO------ Must be X
THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set PROOF (sketch): by minimal counterexample --OOX------------ ----OOXOOXX------ ----X-----O------ ----O-----X------ ----X-----O------ ----O-----X------ --XXXXXOXOO------ Must be X
THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set PROOF (sketch): by minimal counterexample --OOX------------ ----O------------ ----X------------ ----O------------ ----X------------ ----O------------ --XXX------------ We would end up with only 1 vertical line!!
Note: Theorem not true if there are gaps 1 2 3 1 O - O 2 - O X 3 X X- M 1 1 2 3 2 3 GF(M) GS(M)
THEOREM 2 For a gapless M, GS(M) is a perfect graph PROOF: GS(M) is the complement of a comparability graph A Comparability graphs are perfect Comparability Graphs: unoriented that can be oriented to become a partial order
LEMMA: If i<j<k and (i,k) is a SNP conflict then either (i,k) or (j,k) is also a SNP conflict i j k - X O O ? X O X - - O X O ? X X X - O O O X Equal:conflicts with i Different:conflicts with k I.e. if (i,j) is not a conflict and (j,k) is not a conflict, also (i,k) is not a conflict i j k So (u,v) with u < v and u not a conflict with v is a comparability graph A and GS is A complement NOTE: ind set on perfect graph is in P (Lovasz, Schrijvers, Groetschel, 84)
THEOREM: The min SNP removal is NP complete if there can be gaps (even only 2 per fragment) Reduction from MAXCUT QUESTION: what happens to the above problems when the max length of a gap is bounded? QUESTION: what about mixed formulations?
ACKNOWLEDGEMENTS Alberto Caprara for suggesting use of comparability graphs for proof Jinghui Zhang and Andy Clark for inspiring discussion on SNPs discovery and validation
SNP Discovery Allele Frequency Assessment Association Studies Assay Development Genome Organization and Annotation Clinical Trials Diagnostic and Alternative Splicing and Regulatory Elements 3 Dimensional Structure Prognostic Kits Protein Modeling Phylogeny Studies Ribozymes RNA Editing Discovery of New Pharma Targets sno RNA Rational Drug Design Methylation * Reduce Adverse Reactions Pseudouridine** Enhance Therapeutic Effect Anti-Sense Gene Expression Networks Gene Regulation Pathways *Site-specific Ribose Methylation of Transduction Pathways preribosomal RNA ** Site-specific synthesis of Pseudouridine in Ribosomal RNA Protein and Gene Polymorphism Structure Genetic Variation Genome Non Protein Gene Disease Susceptibility Sequence Products New Drug Design Functional Genomics
BRCA2 BARD1 RAD51 RNA polymerase II BRCA1 BAP1 Importin-a Cyberpharmaceutical Computing Genomics The Circle of Life Proteomics Structural Genomics Pharma, Academic, Biotech, Agriculture Needs Functional Genomics Animal Models DNA Expression Profiles SNPs and Pharmacogenomics Comparative Genomics Drug and Vaccine Design