550 likes | 570 Views
Learn about sequence assembly methods, including de Bruijn graphs, overlays, and practical issues like non-uniqueness and incomplete coverage. Explore approaches like maximum overlap and solutions for short reads in bioinformatics.
E N D
Lecture 5. Sequence Assembly The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics
Lecture outline • The sequence assembly problem • Several general approaches • Related graph problems • Hamiltonian path • Eulerian path • Sequence assembly by using de Bruijn graphs CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Part 1 The Sequence Assembly Problem
Reference sequence • In the last lecture, we studied the problem of short read alignment • Assumptions behind the short read alignment problem: • There is a reference genome • The reference is similar to, but not exactly the same as, the DNA sequence from which the short reads were generated • Sometimes, no good references are available CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Lack of a reference sequence • Some situations in which a good reference sequence is not available: • Sequencing a new type of bacteria • Sequencing a genomic region previously poorly annotated • In these situations, we need to reconstruct the sequence by assembling the short reads. • This process is called sequence assembly, or “de novo assembly”. CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Sequence assembly • Example revisited: Suppose we have got the following short reads from multiple copies of an unknown sequence s: • ACA, ATA, ATA, ATT, TAG, TAT, TTC • How to get back sequence s from these short reads? • One possible formulation: Shortest superstring • Find the shortest string s’ such that every observed read is a substring of s’ • There is no guarantee that s’ must be equal to the actual sequence s • Very difficult problem (NP-hard) – No known polynomial time algorithms exist CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Maximum overlap • One heuristic method to solving shortest superstring: Greedy merge of the two strings with the maximum overlap • For example, ATA may be followed by TAG, since the last two characters of ATA are the same as the first two characters of TAG • Similar to playing a jigsaw puzzle Image credit: slideteam.net CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Maximum overlap • Detailed steps: • First remove any input short read that is a sub-sequence of another. For example, • If both ACAT and ACA are in the input, remove ACA. • If there are multiple copies of ACA in the input, remove all but one of them. • For the remaining list of short reads, find the ordered pair (x, y) with the maximum overlap between x’s suffix and y’s prefix. If y contains l characters and the overlap involves k characters, remove y and replace x with the merged sequence xy[k+1..l]. • Tie-breaking rule for our discussion: If there are multiple of them, merge the first pair according to lexicographic order. • (1, 2) < (1, 3) < (2, 1) < (2, 3) < (3, 1) < (3, 2) • Repeat steps 1 and 2 until there is only one sequence left. CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Maximum overlap example • Input short reads: ACA, ATA, ATA, ATT, TAG, TAT, TTA • (Removing reads that are substrings of others)ACA, ATA, ATT, TAG, TAT, TTA • ACA, ATA, ATT, TAG, TAT, TTA • ACA, ATAG, ATT, TAT, TTA • ACA, ATAG, ATT, TAT, TTA • ACA, ATAG, ATTA, TAT • ACA, ATAG, ATTA, TAT • ACA, ATAG, ATTAT • ACA, ATAG, ATTAT • ACA, ATTATAG • ACA, ATTATAG • ACATTATAG • Results: • Reconstructed string different from actual one (TATACATTAG) • Also one character shorter than the actual one (9 vs. 10) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
How good is maximum overlaps? • How long is the reconstructed sequence? • It has been proved that if the shortest superstring has length m, then the reconstructed string has length 4m. (Note that m is different from n, the length of the actual sequence s.) • Is there an example with a reconstructed string with length 2m? • It is possible to have m < n, i.e., the shortest superstring is not the actual sequence s • Is there an example with a reconstructed string with length << n? CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Practical issues • There are many practical issues in sequence assembly: • Non-uniqueness due to short read length • Non-uniqueness due to heterogeneity • Incomplete coverage • Sequencing errors • Ambiguity due to repeats ... CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Practical difficulties and solutions • Short reads: ACA, ATA, ATA, ATT, TAG, TAT, TTC • Problems: • Non-uniqueness due to short read length: TAT may also be followed by ATT (instead of ATA as in TATACATTAG) • Solution: • Use longer reads and/or paired reads, and assemble reads only when they have substantial overlaps • Limited by current technology: <250nt per NGS read CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Paired end sequencing • Sequence both ends of a fragment • The two reads are called a mate pair • Insert sizes form a distribution due to random fragmentation and manual size selection • One read is likely within a certain distance range from the other • If the location of one read is ambiguous, may use the location information of the other to help • More difficult in practice due to imprecise insert size • The idea of paired end sequencing is also useful in short read alignment Fragment length Read length Insert size Sequencing Read1 Read2 TAT ATA ATT ...CAT ...ATT ...GGG (Suppose s=TATACATTAGGG here) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Practical difficulties and solutions • Short reads: ACA, ATA, ATA, ATT, TAG, TAT, TTC • Problems: • Non-uniqueness due to heterogeneity: It is possible that the DNA sample contains two versions of the sequence, one with TATACATTAG and one with TATACATTCG • Solution: • Use an assembly method that can consider multiple possible ways of assembling the reads CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Practical difficulties and solutions • Short reads: ACA, ATA, ATA, ATT, TAG, TAT, TTC • Problems: • Incomplete coverage: We do not know what is after ACA, since no reads start with CA • Solution: • Produce more reads • The ratio between the total length of useful reads and the length of s is called the average read depth. For sequence alignment, it is now common to perform 30x-60x coverage. For sequence assembly, usually we need >100x. • Sometimes we do not know the actual length of s and need to estimate it • Costs more starting materials, reagents, money and computation CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Practical difficulties and solutions • Short reads: ACA, ATA, ATA, ATT, TAG, TAT, TTC • Problems: • Sequencing errors: ATT seems to be followed by TTC, but actually s does not contain TTC as a subsequence (recall that s=TATACATTAG), due to a sequencing error (TTC should actually be TTA) • Solutions: • Use longer reads/paired reads so that if a read contains an error, it has low probability of being assembled • Use some statistics to estimate error probability CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Practical difficulties and solutions • Short reads: ACA, ATA, ATA, ATT, TAG, TAT, TTC • Problems: • Ambiguity due to repeats: Should ATA be put before or after TAT? • This problem is due to the occurrence of two TA’s in s=TATACATTAG • Solution: • Use longer reads/paired reads • Still cannot handle the following cases: • Each unit of a repeat is very long • There are many copies of a repeating unit CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Contigs and scaffolds • Key terms: • Contig: A partially assembled sequence from some reads • Scaffold: An arrangement of the contigs with specified order and orientations • See this video for an illustration: http://www.youtube.com/watch?v=vg7Y5EeZsjk CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Quality of an assembly • Usually the final output does not contain a single sequence, but just some contigs/scaffolds • Some descriptive statistics of assembly outputs: • Length of longest contig • Average length of contigs • Total length of contigs • N50: Length of the contig such that shorter contigs amount to 50% or more of the total length of all contigs • If the lengths are (in an arbitrary unit) 10, 8, 6, 5, 3, 3, 2, 1, 1, 1, then the N50 value would be 6, since (10+8+6) = 24, which is larger than 50% of the sum (10+8+6+5+3+3+2+1+1+1) = 40 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
General approaches • Some other proposed approaches to sequence assembly: • Greedy extension: Extend the current contig until there are no more or multiple extensions • Short reads: ACA, ATA, ATA, ATT, TAG, TAT, TTC • ATT ATTC • Overlap/layout/consensus: Perform all-against-all read alignments to find overlaps, deduce approximate layout, and refine layout by multiple sequence alignment • The all-against-all part requires tremendous computational power • de Bruijn graph (More details later) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Part 2 Related Graph problems
Graph formulations • We can use a graph to provide an abstraction of the short read assembly scenario • Formulation 1: • Each node is a short read • There is an edge from node X to node Y if a suffix of X substantially overlaps with a prefix of node Y • E.g., overlapping at least |X|/2 characters • Goal : Find a path that visits every node exactly once • Because you want every short read to appear exactly once in the resulting sequence • This is a Hamiltonian path problem CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Hamiltonian path formulation • Short reads (new example):ACA, ATA, ATT, CAT, TAC, TAG , TAT, TTA • Suppose a node is connected to another one if the length-2 suffix of the former equals the length-2 prefix of the latter: • Unfortunately, even the decision version (whether such a path exists) is very difficult (NP-complete) ACA ATA TTA ATT TAT CAT TAG TAC CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Graph formulations • Formulation 2: • Each edge is a short read • The two nodes that connect an edge are derived from the prefix and suffix of the corresponding read • Different reads can share nodes • Goal: Find a path that visits every edge exactly once • The Eulerian path problem CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Eulerian path formulation • Short reads:ACA, ATA, ATT, CAT, TAC, TAG , TAT, TTA • Suppose each read is decomposed into two nodes, one containing the length-2 prefix of it and the other the length-2 suffix of it. There is an edge pointing from the former to the latter. • Interestingly, it is much easier to find Eulerian paths AC AT ATT ATA ACA TT AG TTA TAC TAT CAT TAG TA CA CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
The Eulerian path problem AC AT • Existence of an Eulerian path: • The in-degree of a node is the number of edges going into it • The out-degree of a node is the number of edges going out of it • If a connected directed graph has an Eulerian path, the followings are true (why?): • At most one node has (out-degree – in-degree) = 1 • At most one node has (out-degree – in-degree) = -1 • All other nodes have (out-degree – in-degree) = 0 • Surprisingly, if a connected graph satisfies these conditions, it must have an Eulerian path, i.e., the three form a set of both necessary and sufficient conditions ATT ATA ACA TT AG TTA TAC TAT CAT TAG TA CA CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
The Eulerian path problem AC AT • Finding an Eulerian path (Hierholzer’s algorithm): • Start with the node with an extra out-degree (or if not exist, any node) • Follow any unused edges to visit other nodes, until getting stuck • Either back to the starting node and it has no more unused edge, or the node with an extra in-degree (why are there no other possibilities?) • If any visited node has unused edges, repeat the above with this node as the starting node. The path must end at the same node. Join this new path with the old one. ATT ATA ACA TT AG TTA TAC TAT CAT TAG TA CA CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Example AC AT AC AT 2 TT AG TT AG 3 1 TA CA TA CA 4 AC AT AC AT 2 2 5 ii 7 iv TT AG TT AG 3 3 4 i 1 1 6 iii Final answer: TATTACATAG TA CA TA CA 8 4 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Using stacks to find the paths AC AT TT AG Current path stack Completed path stack 2 TA CA AG TA 3 TT 1 AT 4 TA CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Using stacks to find the paths AC AT TT AG Current path stack Completed path stack 2 TA CA TA 3 TT 1 AT 4 TA AG CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Using stacks to find the paths Current path stack Completed path stack ii iv TA AT CA AC TA TT i iii AC AT 2 AT TA AG TT AG 3 1 TA CA 4 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Using stacks to find the paths Current path stack Completed path stack TA ii iv AT TT TA AC CA AT i iii AC AT 2 TA AG TT AG Final answer: TATTACATAG 3 1 TA CA 4 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Quick quiz • What is the graph problem when each node corresponds to a read? • Hamiltonian path problem • What is the graph problem when each edge corresponds to a read? • Eulerian path problem • Which of these two problems is easier in terms of computational complexity? • Eulerian path problem (linear) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Comparing the two formulations Actual DNA sequence (which should be unknown): TATACATTAG • The Hierholzer’s algorithm runs in linear time • Does it mean we have proved P=NP? • A general Hamiltonian path problem is NP hard, but due to the equivalence to the Eulerian path formulation, Hamiltonian path problems *for short reads* can be efficiently solved Hamiltonian path formulation: Eulerian path formulation: AC AT ATT ATA ACA ACA ATA TT AG TTA ATT TTA TAC TAT CAT TAG TA CA TAT CAT TAG TAC CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Why the big difference? • Why is the Eulerian path problem much easier than the general Hamiltonian path problem? • Strict necessary and sufficient conditions for an Eulerian path to exist • The conditions are simple: They can be checked very efficiently • For the Eulerian path problem, the solution of a sub-problem (involving only a subset of the edges) can always contribute towards the solution of the original problem • Another example of reusing results CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Back to sequence assembly • While the Eulerian path formulation is elegant, we need to deal with the many issues when applying it to perform sequence assembly • Non-uniqueness • Errors in data • Repeats • Heterogeneity • ... • We now study some practical methods for solving these issues CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Part 3 de Bruijn Graphs
Problems of the graph formulations • So far we have assumed that each node (in the Hamiltonian path formulation) or each edge (in the Eulerian path formulation) is a short read. • It is problematic if “short” reads are not really that short (but 100-250nt) when determining which reads should be connected in the graph: • If the required overlap is too large, a single error/mutation would make two consecutive reads not connected • TATACATTA, ATAGATTAG • If the required overlap is too small, there will be too many connections and it is easy to have non-unique solutions • There can be a tremendous number of nodes/edges, making the graph not fitting into memory CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
de Bruijn graph • In the original definition proposed by Dutch mathematician Nicolaas de Bruijn in 1946: • A de Bruijn graph is a graph that contains every k-mer of a certain alphabet as a node, and there is a directed edge from a node to another one if the length-(k-1) suffix of the former is the same as the length-(k-1) prefix of the latter. • Example: k=3, alphabet={0, 1} Image source: Wikipedia CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
de Bruijn graph for sequence assembly • To use de Bruijn graph for sequence assembly: • We consider only k-mers that are subsequences of some observed reads • In practice, only a very small fraction of the 4k possible k-mers appear in the reads • Two nodes are connected only if there are reads that contain both at consecutive positions • So, there will not be an edge from ATA to TAG if ATAG is not observed in any read • The number of reads that support each edge is also recorded CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
de Bruijn graph for sequence assembly • Example: • Sequencing reads: a:ACGGC, b:CGGCG, c:CGTGA, d:GACGT, e:GCGTG, f:GGCGT, g:GTGAC and h:TGACG • To construct a de Bruijn graph when each node corresponds to a 3-mer: 1 ACG CGT 2 1 CGG GTG 2 2 GGC TGA 2 2 2 2 GCG GAC CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Patterns of various issues • The various issues we discussed before will form special patterns in a de Bruijn graph: • Non-uniqueness: frayed rope • s=TACCGGACCGC • Observed reads (not necessarily covering all k-mers of s): TACC, ACCG, CCGG, GACC, CCGC • Incomplete coverage: possibly disconnected graph • s=TATACATTAG • Observed reads: TATA, ATAC, ACAT, CATT, ATTA TAC CGG ACC CCG GAC CGC TAT ATA TAC ACA CAT ATT TTA Image credit: cuttingedgedjs.com CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Patterns of various issues • Tandem repeats: cycle • s=TACCGACCGC • Observed reads: ACCG, CCGA, CGAC, GACC, CCGC • Errors at read ends: spur • s=TATACATTAG • Observed reads: TATA, TATT, ATAC, TACA, ACAT CGC ACC CCG CGA GAC TAT ATA TAC ACA CAT ATT Image credit: Wikimedia CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Patterns of various issues • Heterogeneity/errors at read centers: bubble • s=TATACATTAG; s’=TATAGATTAG • Observed reads: TATA, ATAC, TACA, ACAT, CATT, ATAG, TAGA, AGAT, GATT TAC ACA CAT TAT ATA ATT TAG AGA GAT CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Resolving problems • Some methods for resolving the problems (high-level ideas only): • Handling potential errors: • Pre-filtering k-mers supported by few reads • Bimodal distribution of k-mer frequencies: one peak corresponds to legitimate k-mers, the other (much lower) peak due to errors • May also use base quality scores in filtering • Remove paths supported by few reads • Combine near-identical paths CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Resolving problems • Some methods for resolving the problems (high-level ideas only): • Non-uniqueness, heterogeneity: • Duplicate the shared part in frayed ropes and bubbles into separate paths, if supported by read counts • Repeats: • Use read counts to deduce number of copies • Usually not very accurate due to random fluctuations in read counts CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Strand issue • In actual sequencing, often we get sequencing reads from both strands • Example: • +5’CATACATTAG 3’-3’GTATGTAATC 5’ • Suppose each read is 6nt long, we can get the following reads: • From the +ve strand: CATACA, ATACAT, TACATT, ACATTA, CATTAG • From the -ve strand: CTAATG, TAATGT, AATGTA, ATGTAT, TGTATG • The corresponding de Bruijn graph is also more complicated CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Scaffolding • Main types of useful information: • Paired-end reads • Long reads (butmore noisy) • Reference alignment Image credit: Green, Nature Reviews Genetics 2(8):573-583, (2001) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019
Epilogue Case Study, Summary and Further Readings
Case study: “Synthetic cell” • Topic of this lecture so far: • We have multiple copies of an unknown (biological) DNA sequence • We cut them down into small fragments • We sequence each of them to get short reads (text strings) • We assemble the short reads to get back the original (text) sequence • Is it possible to do the opposite? • We have a long text string s • We cut it down into small strings • We biochemically synthesize each of them • We assemble them to get a DNA molecule with sequence s CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019