Paul Medvedev Michael Brudno

Maximum Likelihood Genome Assembly Paul Medvedev Michael Brudno Bioinformatics Algorithms Presented by Md. Tanvir Al Amin, Md. Shaifur Rahman Khalid Mahmood Department of Computer Science and Engineering BUET *Some of the slides are taken from other sources

Computational Genomics • Our genome encodes an enormous amount of information about our beings • our looks • our size • how our bodies work • …. • our health • our behaviors • … who we are! gcgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggtagtagctagatgcagggataaacacacggaggcgagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagtaagcgatgcggctgctgagagcgtaggcccg…….

Contributions of the paper • Two-fold, first one being : • First exact polynomial time algorithm for the shortestdouble-stranded genome, given its k-molecule spectrum • A problem that was solved for strings, but remained open for molecules

Contributions of the paper • Second one : • Oppose the idea of shortest genome • Because It overcollapses • Instead propose a new objective : • A maximum likelihood framework for assembling the genome that is most likely the source of the reads.

Contributions of the paper • Maximum likelihood framework • Assumes perfect reads • Uniform distribution • Advantage of high coverage (NGS) • Estimate copy counts of repeats • Combine with matepair data • Read => Contigs

Outline • Whole Genome Shotgun Assembly • Review of Related Work • The Medvedev-Brudno Method • BidirectedOverlap Graph • Adjustments to the Standard Min-cost Biflow Problem • Maximizing the Global Read-Count Likelihood • Efficiently Solving a Min-cost Biflow • Flow to Contigs • Conflict node resolution • Results • Discussion

Whole Genome Shotgun Sequencing DNA SEQUENCER Sanger vs. NGS reads ASSEMBLER C++ • Problems in Assembly • Sequencing Errors • Unknown Orientation • Incomplete Coverage • Repeats contigs FINISHING sequence

Whole Genome Shotgun Sequencing • Break genome into shotgun-sized fragments and sequence • Match the overlapping regions of contiguous sequences • Demonstrated by Celera Genomics to be feasible for whole genome assembly • Sequenced human genome at 1/10’th the cost of the public Human Genome Project

Whole Genome Assembly Next Generation Sequencing (NGS) ?? • Improved speed and cost-effectiveness relative to the other methods… • … but much shorter read length (25-200 bp) • Only proven on re sequencing projects, i.e. a reference genome is already available • Posses significant challenges to the problem of de novo genome assembly – determination of a completely unknown genome.

Assemblers • Previous (Sanger) Assemblers • NGS Assemblers • SSAKE (Jeck et al., 2007) • VCAKE (Warren et al. 2007) • SHARCGS (Dohm et al. 2007) • Shorty (Chen and Skiena 2007) • ALLPATHS (Butler et al. 2008) • Edena (Hernandez et al. 2008) • Euler-(U)SR (Chaisson and Pevzner 2008, 2009) • Velvet (Zerbino and Birney, 2008)

Theoretical view • Input: set of strings over {A,C,G,T} called reads • Output: A common superstring of the reads. • {TACAT,CATAC, ACGTAC} TACATACGTAC • Initially: Shortest Common Superstring (SCS) • NP-hard [Gallant et al 1980] • Over-collapsing of repeats • Can be found using a TSP solver • de Bruijn graphs [Pevzner, Tang, Waterman 01] • string graphs [Myers 05] • Both formulations are NP-hard.

String graph (Myers) • Represent reads as vertices, and read overlaps as edges • Remove redundant edges • Establish edge constraints • Unique? (flow is exactly one) • Required? (min. flow is 1) • Optional? (min. flow is 0) • Find shortest walk

EULER assembler (Pevzner, Tang and Waterman) • Represent reads as edges and overlaps as vertices in a de Bruijn graph • Assembly can be efficiently solved as an Eulerian Path Problem: each edge must be visited exactly once • Repeats dealt with by using multiple edges for a single repeat read

Overlap Graph • Nodes are reads • Edges are overlaps • Weights are lengths of prefix • TSP Tour is SCS • Example: • {TACAT,CATAC, ACGTAC} TACATACGTAC ACGTAC 5 3 3 CATAC 5 2 2 TACAT

Maximum Likehood Genome Assembly {Medvedev, Brundo} Why Shortest CS? • DNA is full of repeats: identical and nearly identical copies that appear multiple times • Alu repeat is 300beses long, present 1,000,000 times in the human genome • SCS approach “over-collapses” the repeats: they are only present once in the answer • Solution: Model repeats explicitly through either de Brujin graph or String graps • Maybe this will also become tractable?

De Bruijn Graphs {AGC, ATC, ATT, CAG, CAT, GCA, TCA, TTC} • Nodes are (k-1)-mers • Edges are k-mers • The set of k-mers is called a k-spectrum • Finding shortest string with given k-spectrum equivalent to Chinese Postman TT TC AT CA GC AG Pevzner 1989

De Bruijn Graphs with Walks {AGC, ATTCA, CATT, GCAG, ATG} • Nodes are (k-1)-mers • Edges are k-mers • Reads are walks • Finding superwalk (one that includes all walks) • Not a polynomial time problem • De BruijnSuperwalk is NP-hard TT TC AT CA GC AG Pevzner et al 2001

Chinese Postman Tours • Solving Chinese Postman: An Eulerian tour is a solution • Euleriazation: make a graph Eulerian • Can be done with min cost flow: • Unbalanced nodes are sources/sinks • Duplicate all edges used in flow {AGC, ATTCA, CATT, GCAG, ATG} TT TC AT CA GC AG Pevzner 1989

{GTT, TAA, TTG, TGG, GGC, GCA, CAA} 5’ 3’ GTTGGCAAT GG AC AT TG TT GC CA GT CC CC CA GG AC AT GT GC TG TT AA DNA is not a String {AAC, ATT, CAA, CCA, GCC, TGC, TTG} ATTGCCAAC 5’ 3’ AA • The shortest walk that visits every edge at least once (a Chinese postman tour) is the shortest string with the given k-spectrum [Pevzner 1989]

Complexity of CPT

Modeling Double Strandedness Kececioglu 91, Kececioglu-Meyers 95

5’ 3’ GTTGGCAAT Modeling Double-Strandedness • How can two DNA molecules overlap? A A C +CTT +AAC -AAG -GTT ATTGCCAAC C T T 5’ 3’ A A C +TCG +AAC -CGA -GTT T C G T G G +TGG +AAC -CCA -GTT A A C Kececioglu 1992

+AC +AA +AT -GT -TT -AT +CC +CA +GC -GG -TG -GC Walks in bidirected graphs • A walk has to “match” directions at each node. • Suppose the node +AA/TT-. • Edge orientations correspond to strands • A path can use a node in both orientations

+AC +AA +AT -GT -TT -AT +CC +CA +GC -GG -TG -GC Rules for Matching Directions • When we walk through it, we can • Come in using in arrow, then leave using out arrow • This is forward, so read the “+” strand. i.e. AA here • Come in using out arrow, then leave using in arrow • This is backward, so Read the “-” strand, i.e TT here.

+AC +AA +AT -GT -TT -AT +CC +CA +GC -GG -TG -GC Bidirected Graphs • So what this walk corresponds to ? • GGCAAT • ATTGCC

The shortest walk that visits every edge at least once (a Chinese postman tour) is the shortest DNA molecule with the given k-spectrum {GTT, TAA, TTG, TGG, GGC, GCA, CAA} 5’ 3’ +AC +AA GTTGGCAAT +AT -GT -TT -AT +CC +CA +GC -GG -TG -GC +AC +AA +AT -GT -TT -AT +CC +CA +GC -GG -TG -GC Bidirected de Bruijn Graphs {AAC, ATT, CAA, CCA, GCC, TGC, TTG} ATTGCCAAC 5’ 3’

Representing Bidirected graphs

Motivation: Overlap Graphs • Several downsides of the de Bruijn approach • Division into k-mers arbitrary • Very sensitive to sequencing errors • Not memory efficient (one node per k-mer) • Goal • One node per read (or better) • No division into k-mers • Flexibility in the presence of sequencing errors Myers 2005

ACGTAC CATAC TACAT How To Build A Overlap Graph (1) TACATACGTAC {ACGTAC, CATAC, TACAT} • Nodes are reads • Edges are overlaps • Weights are lengths of non-overlapping prefix • Transitively inferable overlaps 3 5 3 2 2

Bidirected Overlap Graph • In this work, authors have used Bidirected overlap graph. • In a bidirected overlap graph, each vertex is a double-stranded read • Edges represent read overlaps

Bidirected Overlap Graph • Three possible ways that two double-stranded reads can overlap (corresponds to the three types of edges) • Suppose we have two reads r1 and r2 • Each read can be oriented to the left or to the right • The three possible overlaps are: • i) Both strands point in the same direction (both reads can point left, or both can point right, it’s the same overlap either way) • ii) r1 points left and r2 points right • iii) r1 points right and r2 points left

Bidirected Overlap Graph • The overlap graph is constructed by placing an edge between two reads if they overlap by a minimum number of characters omin • Question: How is omin determined? • Then perform transitive edge reduction: remove overlaps covered by two shorter overlaps

Observation • A bidirected graph contains an Eulerian circuit if and only if it is connected and balanced.

Chinese postman Problem on Bidirected Graphs

Chinese postman Problem on Bidirected Graphs • Let G be a weighted bidirected graph. There exists a circuit of weight i if and only if there exists an Eulerian extension of weight i. • G has a circuit if and only if it is strongly connected. • The minimum weight Eulerian extension of G has at most 2|E||V| edges.

Chinese postman Problem on Bidirected Graphs • The running time of Algorithm 1 is O(|E|2log(|V|)log(E)). • Gabow’s algorithm runs in O(|E|2log(|V|)log(max(u(e))) • u is the flow upper bound function • f(e) <= 2 |E| |V| for every edge e, • So, we can safely let u(e) = 2 |E| |V|

Chinese postman Problem on Bidirected Graphs • Hence the theorem is proved : • Given a set of k-molecules S, we can find the shortest (k-1)-circular DNA molecule whose k-molecule spectrum is S in time O(|S|2log2(|S|)). • This is a polynomial time algorithm, explicitly handling the double strandedness • The first main result of this paper.

Sequence assembly using NGS Sequence assembly using NGS • Several methods available now (e.g. SSAKE, VCAKE, SHARCGS, etc.) • All of these assume that the length of the assembled genome must be minimized • Results in over-collapsing of repeats • Given ubiquity of repeats in eukaryotic genomes, authors considered this a poor assumption

Goal of an Assembler • What should the goal of an assembler be ?? • Shortest string ?? • Problem of over-collapse

Maximum Likelihood Genome Assembly • Change goal of sequence assembly • Maximize the likelihood that the resultant genome was the source of the given reads • Take advantage of the high coverage of NGS to statistically estimate the copy-count of each read: identify and quantify repeats • Maximizing the likelihood of observed read frequencies can be cast as mininum cost bidirected flow (biflow) problem • Allows solution to be obtained with an off-the-shelf network flow solver • Authors claim 99.99% accuracy

Maximum Likelihood Genome Assembly • Second important aspect is the use of matepair information for joining contigs • Other systems look for all paths between mated reads • The proposed Method looks only for short paths between some pairs of reads • Question: How to decide the upper bound for these “short paths”? And how to decide which pairs of reads to examine?

Adjustments to the Standard Min-cost Biflow Problem Standard Min-cost Biflow Problem • Set upper and lower flow bounds on each edge • Flow function f : E→ N must obey the constraint for each edge e • For each vertex, the incoming flow is balanced with the outgoing flow • Objective: Find the flow that minimizes

Adjustments to the Standard Min-cost Biflow Problem Medvedev-Brudno Min-cost Biflow Problem • Upper and lower flow bounds on vertices as well • Accomplished by splitting every vertex v into two: • v+ and v-

Adjustments to the Standard Min-cost BiflowProblem • v- serves as the “incoming” vertex, and inherits v’ incoming edges • v+ serves as the “outgoing” vertex, and inherits v’s outgoing edges • Finally add one edge between v- and v+ and assign it the upper and lower flow bounds for v

Adjustments to the Standard Min-cost BiflowProblem • Second variation: represent the cost ce as a convex function • A function is convex if every point on or above it forms a convex set • A convex set refers to an area where, for every pair of points within that area, every point on the straight line segment connecting those points also lies within that area

Convex Function

Paul Medvedev Michael Brudno