950 likes | 1.11k Views
Maximum Likelihood Genome Assembly. Paul Medvedev Michael Brudno. Bioinformatics Algorithms. Presented by Md. Tanvir Al Amin, Md. Shaifur Rahman Khalid Mahmood. Department of Computer Science and Engineering. BUET. *Some of the slides are taken from other sources.
E N D
Maximum Likelihood Genome Assembly Paul Medvedev Michael Brudno Bioinformatics Algorithms Presented by Md. Tanvir Al Amin, Md. Shaifur Rahman Khalid Mahmood Department of Computer Science and Engineering BUET *Some of the slides are taken from other sources
Computational Genomics • Our genome encodes an enormous amount of information about our beings • our looks • our size • how our bodies work • …. • our health • our behaviors • … who we are! gcgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggtagtagctagatgcagggataaacacacggaggcgagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagtaagcgatgcggctgctgagagcgtaggcccg…….
Contributions of the paper • Two-fold, first one being : • First exact polynomial time algorithm for the shortestdouble-stranded genome, given its k-molecule spectrum • A problem that was solved for strings, but remained open for molecules
Contributions of the paper • Second one : • Oppose the idea of shortest genome • Because It overcollapses • Instead propose a new objective : • A maximum likelihood framework for assembling the genome that is most likely the source of the reads.
Contributions of the paper • Maximum likelihood framework • Assumes perfect reads • Uniform distribution • Advantage of high coverage (NGS) • Estimate copy counts of repeats • Combine with matepair data • Read => Contigs
Outline • Whole Genome Shotgun Assembly • Review of Related Work • The Medvedev-Brudno Method • BidirectedOverlap Graph • Adjustments to the Standard Min-cost Biflow Problem • Maximizing the Global Read-Count Likelihood • Efficiently Solving a Min-cost Biflow • Flow to Contigs • Conflict node resolution • Results • Discussion
Outline • Whole Genome Shotgun Assembly • Review of Related Work • The Medvedev-Brudno Method • BidirectedOverlap Graph • Adjustments to the Standard Min-cost Biflow Problem • Maximizing the Global Read-Count Likelihood • Efficiently Solving a Min-cost Biflow • Flow to Contigs • Conflict node resolution • Results • Discussion
Whole Genome Shotgun Sequencing DNA SEQUENCER Sanger vs. NGS reads ASSEMBLER C++ • Problems in Assembly • Sequencing Errors • Unknown Orientation • Incomplete Coverage • Repeats contigs FINISHING sequence
Whole Genome Shotgun Sequencing • Break genome into shotgun-sized fragments and sequence • Match the overlapping regions of contiguous sequences • Demonstrated by Celera Genomics to be feasible for whole genome assembly • Sequenced human genome at 1/10’th the cost of the public Human Genome Project
Whole Genome Assembly Next Generation Sequencing (NGS) ?? • Improved speed and cost-effectiveness relative to the other methods… • … but much shorter read length (25-200 bp) • Only proven on re sequencing projects, i.e. a reference genome is already available • Posses significant challenges to the problem of de novo genome assembly – determination of a completely unknown genome.
Assemblers • Previous (Sanger) Assemblers • NGS Assemblers • SSAKE (Jeck et al., 2007) • VCAKE (Warren et al. 2007) • SHARCGS (Dohm et al. 2007) • Shorty (Chen and Skiena 2007) • ALLPATHS (Butler et al. 2008) • Edena (Hernandez et al. 2008) • Euler-(U)SR (Chaisson and Pevzner 2008, 2009) • Velvet (Zerbino and Birney, 2008)
Outline • Whole Genome Shotgun Assembly • Review of Related Work • The Medvedev-Brudno Method • BidirectedOverlap Graph • Adjustments to the Standard Min-cost Biflow Problem • Maximizing the Global Read-Count Likelihood • Efficiently Solving a Min-cost Biflow • Flow to Contigs • Conflict node resolution • Results • Discussion
Theoretical view • Input: set of strings over {A,C,G,T} called reads • Output: A common superstring of the reads. • {TACAT,CATAC, ACGTAC} TACATACGTAC • Initially: Shortest Common Superstring (SCS) • NP-hard [Gallant et al 1980] • Over-collapsing of repeats • Can be found using a TSP solver • de Bruijn graphs [Pevzner, Tang, Waterman 01] • string graphs [Myers 05] • Both formulations are NP-hard.
String graph (Myers) • Represent reads as vertices, and read overlaps as edges • Remove redundant edges • Establish edge constraints • Unique? (flow is exactly one) • Required? (min. flow is 1) • Optional? (min. flow is 0) • Find shortest walk
EULER assembler (Pevzner, Tang and Waterman) • Represent reads as edges and overlaps as vertices in a de Bruijn graph • Assembly can be efficiently solved as an Eulerian Path Problem: each edge must be visited exactly once • Repeats dealt with by using multiple edges for a single repeat read
Overlap Graph • Nodes are reads • Edges are overlaps • Weights are lengths of prefix • TSP Tour is SCS • Example: • {TACAT,CATAC, ACGTAC} TACATACGTAC ACGTAC 5 3 3 CATAC 5 2 2 TACAT
Maximum Likehood Genome Assembly {Medvedev, Brundo} Why Shortest CS? • DNA is full of repeats: identical and nearly identical copies that appear multiple times • Alu repeat is 300beses long, present 1,000,000 times in the human genome • SCS approach “over-collapses” the repeats: they are only present once in the answer • Solution: Model repeats explicitly through either de Brujin graph or String graps • Maybe this will also become tractable?
De Bruijn Graphs {AGC, ATC, ATT, CAG, CAT, GCA, TCA, TTC} • Nodes are (k-1)-mers • Edges are k-mers • The set of k-mers is called a k-spectrum • Finding shortest string with given k-spectrum equivalent to Chinese Postman TT TC AT CA GC AG Pevzner 1989
De Bruijn Graphs with Walks {AGC, ATTCA, CATT, GCAG, ATG} • Nodes are (k-1)-mers • Edges are k-mers • Reads are walks • Finding superwalk (one that includes all walks) • Not a polynomial time problem • De BruijnSuperwalk is NP-hard TT TC AT CA GC AG Pevzner et al 2001
Chinese Postman Tours • Solving Chinese Postman: An Eulerian tour is a solution • Euleriazation: make a graph Eulerian • Can be done with min cost flow: • Unbalanced nodes are sources/sinks • Duplicate all edges used in flow {AGC, ATTCA, CATT, GCAG, ATG} TT TC AT CA GC AG Pevzner 1989
{GTT, TAA, TTG, TGG, GGC, GCA, CAA} 5’ 3’ GTTGGCAAT GG AC AT TG TT GC CA GT CC CC CA GG AC AT GT GC TG TT AA DNA is not a String {AAC, ATT, CAA, CCA, GCC, TGC, TTG} ATTGCCAAC 5’ 3’ AA • The shortest walk that visits every edge at least once (a Chinese postman tour) is the shortest string with the given k-spectrum [Pevzner 1989]
Modeling Double Strandedness Kececioglu 91, Kececioglu-Meyers 95
5’ 3’ GTTGGCAAT Modeling Double-Strandedness • How can two DNA molecules overlap? A A C +CTT +AAC -AAG -GTT ATTGCCAAC C T T 5’ 3’ A A C +TCG +AAC -CGA -GTT T C G T G G +TGG +AAC -CCA -GTT A A C Kececioglu 1992
+AC +AA +AT -GT -TT -AT +CC +CA +GC -GG -TG -GC Walks in bidirected graphs • A walk has to “match” directions at each node. • Suppose the node +AA/TT-. • Edge orientations correspond to strands • A path can use a node in both orientations
+AC +AA +AT -GT -TT -AT +CC +CA +GC -GG -TG -GC Rules for Matching Directions • When we walk through it, we can • Come in using in arrow, then leave using out arrow • This is forward, so read the “+” strand. i.e. AA here • Come in using out arrow, then leave using in arrow • This is backward, so Read the “-” strand, i.e TT here.
+AC +AA +AT -GT -TT -AT +CC +CA +GC -GG -TG -GC Bidirected Graphs • So what this walk corresponds to ? • GGCAAT • ATTGCC
The shortest walk that visits every edge at least once (a Chinese postman tour) is the shortest DNA molecule with the given k-spectrum {GTT, TAA, TTG, TGG, GGC, GCA, CAA} 5’ 3’ +AC +AA GTTGGCAAT +AT -GT -TT -AT +CC +CA +GC -GG -TG -GC +AC +AA +AT -GT -TT -AT +CC +CA +GC -GG -TG -GC Bidirected de Bruijn Graphs {AAC, ATT, CAA, CCA, GCC, TGC, TTG} ATTGCCAAC 5’ 3’
Motivation: Overlap Graphs • Several downsides of the de Bruijn approach • Division into k-mers arbitrary • Very sensitive to sequencing errors • Not memory efficient (one node per k-mer) • Goal • One node per read (or better) • No division into k-mers • Flexibility in the presence of sequencing errors Myers 2005
ACGTAC CATAC TACAT How To Build A Overlap Graph (1) TACATACGTAC {ACGTAC, CATAC, TACAT} • Nodes are reads • Edges are overlaps • Weights are lengths of non-overlapping prefix • Transitively inferable overlaps 3 5 3 2 2
Bidirected Overlap Graph • In this work, authors have used Bidirected overlap graph. • In a bidirected overlap graph, each vertex is a double-stranded read • Edges represent read overlaps
Bidirected Overlap Graph • Three possible ways that two double-stranded reads can overlap (corresponds to the three types of edges) • Suppose we have two reads r1 and r2 • Each read can be oriented to the left or to the right • The three possible overlaps are: • i) Both strands point in the same direction (both reads can point left, or both can point right, it’s the same overlap either way) • ii) r1 points left and r2 points right • iii) r1 points right and r2 points left
Bidirected Overlap Graph • The overlap graph is constructed by placing an edge between two reads if they overlap by a minimum number of characters omin • Question: How is omin determined? • Then perform transitive edge reduction: remove overlaps covered by two shorter overlaps
Observation • A bidirected graph contains an Eulerian circuit if and only if it is connected and balanced.
Chinese postman Problem on Bidirected Graphs • Let G be a weighted bidirected graph. There exists a circuit of weight i if and only if there exists an Eulerian extension of weight i. • G has a circuit if and only if it is strongly connected. • The minimum weight Eulerian extension of G has at most 2|E||V| edges.
Chinese postman Problem on Bidirected Graphs • The running time of Algorithm 1 is O(|E|2log(|V|)log(E)). • Gabow’s algorithm runs in O(|E|2log(|V|)log(max(u(e))) • u is the flow upper bound function • f(e) <= 2 |E| |V| for every edge e, • So, we can safely let u(e) = 2 |E| |V|
Chinese postman Problem on Bidirected Graphs • Hence the theorem is proved : • Given a set of k-molecules S, we can find the shortest (k-1)-circular DNA molecule whose k-molecule spectrum is S in time O(|S|2log2(|S|)). • This is a polynomial time algorithm, explicitly handling the double strandedness • The first main result of this paper.
Outline • Whole Genome Shotgun Assembly • Review of Related Work • The Medvedev-Brudno Method • BidirectedOverlap Graph • Adjustments to the Standard Min-cost Biflow Problem • Maximizing the Global Read-Count Likelihood • Efficiently Solving a Min-cost Biflow • Flow to Contigs • Conflict node resolution • Results • Discussion
Sequence assembly using NGS Sequence assembly using NGS • Several methods available now (e.g. SSAKE, VCAKE, SHARCGS, etc.) • All of these assume that the length of the assembled genome must be minimized • Results in over-collapsing of repeats • Given ubiquity of repeats in eukaryotic genomes, authors considered this a poor assumption
Goal of an Assembler • What should the goal of an assembler be ?? • Shortest string ?? • Problem of over-collapse
Maximum Likelihood Genome Assembly • Change goal of sequence assembly • Maximize the likelihood that the resultant genome was the source of the given reads • Take advantage of the high coverage of NGS to statistically estimate the copy-count of each read: identify and quantify repeats • Maximizing the likelihood of observed read frequencies can be cast as mininum cost bidirected flow (biflow) problem • Allows solution to be obtained with an off-the-shelf network flow solver • Authors claim 99.99% accuracy
Maximum Likelihood Genome Assembly • Second important aspect is the use of matepair information for joining contigs • Other systems look for all paths between mated reads • The proposed Method looks only for short paths between some pairs of reads • Question: How to decide the upper bound for these “short paths”? And how to decide which pairs of reads to examine?
Outline • Whole Genome Shotgun Assembly • Review of Related Work • The Medvedev-Brudno Method • BidirectedOverlap Graph • Adjustments to the Standard Min-cost Biflow Problem • Maximizing the Global Read-Count Likelihood • Efficiently Solving a Min-cost Biflow • Flow to Contigs • Conflict node resolution • Results • Discussion
Adjustments to the Standard Min-cost Biflow Problem Standard Min-cost Biflow Problem • Set upper and lower flow bounds on each edge • Flow function f : E→ N must obey the constraint for each edge e • For each vertex, the incoming flow is balanced with the outgoing flow • Objective: Find the flow that minimizes
Adjustments to the Standard Min-cost Biflow Problem Medvedev-Brudno Min-cost Biflow Problem • Upper and lower flow bounds on vertices as well • Accomplished by splitting every vertex v into two: • v+ and v-
Adjustments to the Standard Min-cost BiflowProblem • v- serves as the “incoming” vertex, and inherits v’ incoming edges • v+ serves as the “outgoing” vertex, and inherits v’s outgoing edges • Finally add one edge between v- and v+ and assign it the upper and lower flow bounds for v
Adjustments to the Standard Min-cost BiflowProblem • Second variation: represent the cost ce as a convex function • A function is convex if every point on or above it forms a convex set • A convex set refers to an area where, for every pair of points within that area, every point on the straight line segment connecting those points also lies within that area