1 / 48

Outline

Outline. Shotgun assembly. Haploid separation and errors. Formalizing the problem. Gaps and gapless. Polynomial and NP-complete cases. Main result in a nutshell: Gapless problems are Polynomial , with gaps they are NP-complete. Shotgun Assembly. [ Webber and Myers, 1997].

eileen
Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Outline • Shotgun assembly • Haploid separation and errors • Formalizing the problem • Gaps and gapless • Polynomial and NP-complete cases Main result in a nutshell: Gapless problems are Polynomial, with gaps they are NP-complete.

  2. Shotgun Assembly [ Webber and Myers, 1997] ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA fragmentation ACTGA GATTT GCCTAG CTATCTT ATAGATA GAGATTTC TAGAAATC TGAGCCTAG TAGAGATTTC TCCTAAAGAT CGCATAGATA sequencing TGAGCCTAG GATTT GCCTAG CTATCTT ATAGATA GAGATTTCTAGAAATC ACTGA TAGAGATTTC TCCTAAAGAT CGCATAGATA assembly ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT

  3. 2 Chromosomes instead of one consensus: ACTGAAAGCGATCGATCGACTAGAGACAGATAG ACTGATAGCGATCCATCGATTAGAGTCAGATAG ATTAGAGTCA GAAAGC TAGAGTCAGA CTAGAGAC ATCGATAG ATCCAT • How do we find both chromosomes from the • isolated fragments?

  4. 2 Chromosomes instead of one consensus: ACTGAAAGCGATCGATCGACTAGAGACAGATAG ACTGATAGCGATCCATCGATTAGAGTCAGATAG ATTAGAGTCA GAAAGC TAGAGTCAGA CTAGAGAC ATCGATAG ATCCAT • We use SNPs (Single Nucleotide Polymorphisms)

  5. actgaAagcgatcGatcgaCtagagA_______ actgaTagcgatcCatcgaGtagagT_______ Haplotype 1 Haplotype 2 This individual is (AGCA, TCGT) Genotype SNPs frequency: 1 about every 1,000 bp SNPs useful for: - disease association - forensic - assembly of genomes - evolutionary

  6. ERROR SOURCES Sequencing errors: ACTGCCTGGCCAATGGAACGGACAAG CTGGCCAAT CATTGGAAC AATGGAACGGA Paralogous regions: ACAAACCCTTTGGGACT … CTAGTAAACCCTATGGGGA AAACCCTT TAAACCCT CTATGGGA CCTATGG CTTTGGGACT ACCCTATGGG

  7. The data: a SNP matrix ACTGAAAGCGA ACTAGAGACAGCATG ACTGATAGC GTAGAGTCA ACTG TCGACTAGA CATG ACTGA CGATCCATCG TCAGC ACTGAAA ATCGATC AGCATG ACTGAAAGCGAACTAGAGACAGCATG ACTGATAGCGTAGAGTCA ACTGTCGACTAGACATG ACTGACGATCCATCGTCAGC ACTGAAAATCGATCAGCATG XXO OOX X XX X O

  8. Snips 1,..,n 1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O 6 - - - - O O O X - Fragments 1,..,m

  9. Snips 1,..,n 1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O 6 - - - - O O O X - Fragments 1,..,m Fragment conflict: can’t be on same haplotype

  10. Snips 1,..,n 1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O 6 - - - - O O O X - Fragments 1,..,m Fragment conflict: can’t be on same haplotype Fragment Conflict Graph GF(M) 1 4 We have 2 haplotypes iff GF is BIPARTITE 5 2 6 3

  11. Snips 1,..,n 1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O 6 - - - - O O O X - Fragments 1,..,m PROBLEM (Fragment Removal): make GF Bipartite 1 4 5 2 6 3

  12. Snips 1,..,n 1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O 6 - - - - O O O X - Fragments 1,..,m PROBLEM (Fragment Removal): make GF Bipartite 1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 4 O O X - - - - O - 3 X X O X X - - - - 5 - - - - - - - X O 1 4 5 2 O O X O X X O O X 6 3 X X O X X - - X O

  13. Removing fewest fragments is equivalent to maximum induced bipartite subgraph NP-complete [Yannakakis, 1978a, 1978b; Lewis, 1978] O(|V|(log log |V|/log |V|)2)-approximable [Halldórsson, 1999] not O(|V|)-approximable for some  [Lund and Yannakakis, 1993] Note: for every G there is M s.t. G = GF(M) Are there cases of M for which GF(M) is easier? YES: the gapless M ---OXXOO---OXOOX--- gap ---OXXOOXOXOXOOX--- gapless ---OXX--XO----OX--- 2 gaps

  14. Why gaps? Sequencing errors (don’t call with low confidence) ---OOXX?XX--- ===> ---OOXX-XX--- Celera’s mate pairs attcgttgtagtggtagcctaaatgtcggtagaccttga attcgttgtagtggtagcctaaatgtcggtagaccttga Why not gaps? EST (Expressed Sequence Tagged) mapping

  15. The gapless case: a reduction to Min Cost Flow OXXO--- -XXOX-- -OXXX-- ---OXX- ---XXO- ----XOX ----XXO

  16. The gapless case: a reduction to Min Cost Flow OXXO--- -XXOX-- -OXXX-- ---OXX- ---XXO- ----XOX ----XXO

  17. The gapless case: a reduction to Min Cost Flow OXXO--- -XXOX-- -OXXX-- ---OXX- ---XXO- ----XOX ----XXO

  18. The gapless case: a reduction to Min Cost Flow s 1,0 OXXO--- -XXOX-- -OXXX-- ---OXX- ---XXO- ----XOX ----XXO 2,0 1,1 t 1,1 Put capacity of 1 on each node: find max circulation

  19. Same construction, different costs, finds the 2 longest compatible haplotypes Does not need to be gapless. Enough if it can be sorted to become such (Consecutive Ones Property, Booth and Lueker, 1976) NP Complete if 2 gaps per fragment (Max Bip. Induced Subgraph on 3-regular graphs) THEOREM : NP Complete if even 1 gap per fragment (reduction from MAX2SAT)

  20. The fragment removal is good to get rid of contaminants. However, we may want to keep all fragments and correct errors otherwise A dual point of view is to disregard some SNPs and keep the largest subset sufficient to reconstruct the haplotypes All fragments get assigned to one of the two haplotypes. We describe the min SNP removal problem: remove the fewest number of columns from M so that the fragment graph becomes bipartite.

  21. SNP conflicts - - - O X X O O - - O X O X - - - X X XO X X - - - - O O X - - - O O - - - - - - - X X O - - - - O O O X -

  22. SNP conflicts - - - O X X O O - - O X O X - - - X X XO X X - - - - O O X - - - O O - - - - - - - X X O - - - - O O O X - OK

  23. SNP conflicts - - - O X X O O - - O X O X - - - X X XO X X - - - - O O X - - - O O - - - - - - - X X O - - - - O O O X - OK

  24. SNP conflicts - - - O X X O O - - O X O X - - - X X XO X X - - - - O O X - - - O O - - - - - - - X X O - - - - O O O X - OK

  25. SNP conflicts - - - O X X O O - - O X O X - - - X X XO X X - - - - O O X - - - O O - - - - - - - X X O - - - - O O O X - CONFLICT !

  26. SNP conflicts - - - O X X O O - - O X O X - - - X X XO X X - - - - O O X - - - O O - - - - - - - X X O - - - - O O OX - CONFLICT !

  27. SNP conflicts - - - O X X O O - - O X O X - - - X X XO X X - - - - O O X - - - O O - - - - - - - X X O - - - - O O O X - SNP conflict graph GS(M) 1 node for each SNP (column) edge between conflicting SNPs

  28. SNP conflicts 1 2 3 4 5 6 7 8 9 - - - O X X O O - - O X O X - - - X X XO X X - - - - O O X - - - O O - - - - - - - X X O - - - - O O O X - 1 4 8 2 5 7 3 6 9

  29. THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set THEOREM 2 For a gapless M, GS(M) is a perfect graph COROLLARY For a gapless M, the min SNP removal problem is polynomial

  30. THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set PROOF (sketch): by minimal counterexample --OOXXOO--------- ----OOXOOXOXXO--- --------XXOXOXXX- ----XXOOXOXXO---- -------XOOOX----- ------XXXXXO----- --XXOXXOXOO------ Take an odd cycle in GF

  31. THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set PROOF (sketch): by minimal counterexample --OOXXOO--------- ----OOXOOXOXXO--- --------XXOXOXXX- ----XXOOXOXXO---- -------XOOOX----- ------XXXXXO----- --XXOXXOXOO------ There is a generic structure of hor-vert cycle

  32. THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set PROOF (sketch): by minimal counterexample --OOXXOO--------- ----OOXOOXOXXO--- --------XXOXOXXX- ----XXOOXOXXO---- -------XOOOX----- ------XXXXXO----- --XXOXXOXOO------ Remove extra symbols. Still a gapless odd cycle

  33. THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set PROOF (sketch): by minimal counterexample --OOX------------ ----OOXOOXOXXO--- ----------OXOX--- ----------X------ ----------O------ ----------X------ --XXOXXOXOO------ Remove extra symbols. Still a gapless odd cycle

  34. THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set PROOF (sketch): by minimal counterexample --OOX------------ ----OOXOOXOXXO--- ----------OXOX--- ----------X------ ----------O------ ----------X------ --XXOXXOXOO------ “vertical lines” There cannot be only one vertical line in odd cycle We merge rightmost and next to reduce them by 1

  35. THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set Must be X PROOF (sketch): by minimal counterexample --OOX------------ ----OOXOOXOXXO--- ----------OXOX--- ----------X------ ----------O------ ----------X------ --XXOXXOXOO------

  36. THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set Must be X PROOF (sketch): by minimal counterexample --OOX------------ ----OOXOOXXXXO--- ----------OXOX--- ----------X------ ----------O------ ----------X------ --XXOXXOXOO------ Merge the rightmost lines

  37. THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set PROOF (sketch): by minimal counterexample --OOX------------ ----OOXOOXX------ ----------O------ ----------X------ ----------O------ ----------X------ --XXOXXOXOO------ Merge the rightmost lines Repeat

  38. THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set PROOF (sketch): by minimal counterexample --OOX------------ ----OOXOOXX------ ----------O------ ----------X------ ----------O------ ----------X------ --XXOXXOXOO------ Must be X

  39. THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set PROOF (sketch): by minimal counterexample --OOX------------ ----OOXOOXX------ ----X-----O------ ----O-----X------ ----X-----O------ ----O-----X------ --XXXXXOXOO------ Must be X

  40. THEOREM 1 For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set PROOF (sketch): by minimal counterexample --OOX------------ ----O------------ ----X------------ ----O------------ ----X------------ ----O------------ --XXX------------ We would end up with only 1 vertical line!!

  41. Note: Theorem not true if there are gaps 1 2 3 1 O - O 2 - O X 3 X X- M 1 1 2 3 2 3 GF(M) GS(M)

  42. THEOREM 2 For a gapless M, GS(M) is a perfect graph PROOF: GS(M) is the complement of a comparability graph A Comparability graphs are perfect Comparability Graphs: unoriented that can be oriented to become a partial order

  43. LEMMA: If i<j<k and (i,k) is a SNP conflict then either (i,k) or (j,k) is also a SNP conflict i j k - X O O ? X O X - - O X O ? X X X - O O O X Equal:conflicts with i Different:conflicts with k I.e. if (i,j) is not a conflict and (j,k) is not a conflict, also (i,k) is not a conflict i j k So (u,v) with u < v and u not a conflict with v is a comparability graph A and GS is A complement NOTE: ind set on perfect graph is in P (Lovasz, Schrijvers, Groetschel, 84)

  44. THEOREM: The min SNP removal is NP complete if there can be gaps (even only 2 per fragment) Reduction from MAXCUT QUESTION: what happens to the above problems when the max length of a gap is bounded? QUESTION: what about mixed formulations?

  45. ACKNOWLEDGEMENTS Alberto Caprara for suggesting use of comparability graphs for proof Jinghui Zhang and Andy Clark for inspiring discussion on SNPs discovery and validation

  46. SNP Discovery Allele Frequency Assessment Association Studies Assay Development Genome Organization and Annotation Clinical Trials Diagnostic and Alternative Splicing and Regulatory Elements 3 Dimensional Structure Prognostic Kits Protein Modeling Phylogeny Studies Ribozymes RNA Editing Discovery of New Pharma Targets sno RNA Rational Drug Design Methylation * Reduce Adverse Reactions Pseudouridine** Enhance Therapeutic Effect Anti-Sense Gene Expression Networks Gene Regulation Pathways *Site-specific Ribose Methylation of Transduction Pathways preribosomal RNA ** Site-specific synthesis of Pseudouridine in Ribosomal RNA Protein and Gene Polymorphism Structure Genetic Variation Genome Non Protein Gene Disease Susceptibility Sequence Products New Drug Design Functional Genomics

  47. BRCA2 BARD1 RAD51 RNA polymerase II BRCA1 BAP1 Importin-a Cyberpharmaceutical Computing Genomics The Circle of Life Proteomics Structural Genomics Pharma, Academic, Biotech, Agriculture Needs Functional Genomics Animal Models DNA Expression Profiles SNPs and Pharmacogenomics Comparative Genomics Drug and Vaccine Design

More Related