330 likes | 723 Views
Fragment assembly of DNA. A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them. Fragment assembly of DNA. Biological background Models Algorithms Heuristics. Biological background. Problem as puzzle
E N D
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Fragment assembly of DNA • Biological background • Models • Algorithms • Heuristics ® Pei-Jie Wu
Biological background • Problem as puzzle • We do not know which letter from the set {A, C, G, T} is written on each card, but we do know that cards in the same position of opposite stands from a complementary pair. • Our goal is obtain the letters using certain hint, which are (approximate) substrings of the rows. ® Pei-Jie Wu
Biological background • Target: The long sequence to reconstruct. • Fragment vs. Subsequence • Shotgun method:Based on fragment overlap • Fragment assembly: A collection of fragments to put together ® Pei-Jie Wu
Biological background--The ideal case • Case: p.106 • Aligned the input set, ignoring spaces at the extremities • Overlaps: the end part of a fragment is similar to the beginning of another • Consensus sequence base on majority vote ® Pei-Jie Wu
Biological background--Complications • The main factors that add to the complexity of the problem are: • Error • Unknown orientation • Repeated regions • Lack of coverage. ® Pei-Jie Wu
Biological background--Complications Errors • It usually means algorithms that require more time and space when computer program deal with error. • The simplest errors are called base call errors and comprise base substitutions, insertions and deletions in the fragments. • Base call errors occurs in practice at rates varying from 1 to 5 errors every 100 characters. • Figures 4.2, 4.3, 4.4 ® Pei-Jie Wu
Biological background--Complications Errors • Two other types of errors: chimera and Contamination • Chimeras, arise when two regular fragments from distinct parts of the target molecule join end-to-end to form a fragment that is not a contiguous part of the target • Figure 4.5 • Solution: Must be recognized as such and removed from the fragment set in a preprocessing stage. • Contamination is from host or vector DNA • Solution: Most vectors are well know, so we can screen the data before starting assembly. ® Pei-Jie Wu
Biological background--Complications Unknown orientation • We generally do not know to which strand a particular fragment belongs to. • The input fragments as being all approximate substrings of the consensus sought either as given or in reverse complement. • Figure 4.6 • Complexity: 2n ® Pei-Jie Wu
Biological background--Complications Repeated regions • Repeats are sequences that appear two or more times in the targrt molecule. • Short repeats • Longer repeats • If the level of similarity between two copies of a repeat is high enough, the differences can be mistaken for base call errors • Figure 4.7 ® Pei-Jie Wu
Biological background--Complications Repeated regions • Problems: • If a fragment is totally contained in a repeat, we may have several places to put it in the final alignment. When the copies are not exactly equal, we may weaken the consensus by placing a fragment in the wrong way copy. • Repeats can be positioned in such a way as to render assembly inherently ambiguous. (Figure 4.8 and 4.9) • Direct repeats: repeated copies in the same strand. • Inverted repeats: repeated regions in opposite strands (Figure 4.10) ® Pei-Jie Wu
Biological background--Complications Lack of coverage • Coverage: position i of the target as the number of fragments that cover this position. • Contigs: The contiguously covered regions • Figure 4.11 • Solutions: • Sampling more fragments • Directed sequencing or walking ® Pei-Jie Wu
Biological background--Alternative methods for DNA sequencing • Directed sequencing: a method that can be used to cover small remaining gaps in a shotgun project. • Problem: • It is expensive to build special primers • Sequential rather than parallel • Sequencing by hybridization (SBH), it consists of assembling the target molecule based on many hybridization experiments with very short, fixed length sequences called probes. ® Pei-Jie Wu
Models • Shortest common superstring (SCS) • RECONSTRUCTION • MULTICONTIG • All three assume that the fragment collection is free of contamination and chimeras. ® Pei-Jie Wu
Models--Shortest common superstring • Seeking the shortest superstring of a collection of given strings • PROBLEM: Shortest common superstring (SCS) • INPUT: a collectionF of strings. • OUTPUT: a shortest possible string S such that for every fF , S is a superstring of f. ® Pei-Jie Wu
Models--Shortest common superstring • Example 4.1 • Example 4.2 • Figure 4.12 • Figure 4.13 • A superstring may contain only one copy, which will absorb all fragments totally contained in any of the copies ® Pei-Jie Wu
Models--Reconstruction • Takes into account both errors and unknown orientation • Dynamic programming sequence comparison algorithm • Use distance rather than similarity • Expression: p.116 ® Pei-Jie Wu
Models--Reconstruction • PROBLEM: RECONSTRUCTION • INPUT: a collectionF of strings and an error tolerance between 1 and 0. • OUTPUT: (p.117) • Find a string S as short as possoble such that either f or its reverse complement must be an approximate substring of S at error level • Does not model repeats, lack of coverage, and size of target ® Pei-Jie Wu
Models--Multicontig • Involve internal linkage of the fragments in the layout • Nonlink: there is a fragment that properly contains the overlap on both sides • Weakest link: the smallest size of any link • t-contig: the weakest link of a layout is at least as large as t • Example 4.4 • Definition: p.119 ® Pei-Jie Wu
Algorithms • Greedy algorithm • Acyclic subgraphs (no errors and know orientation) ® Pei-Jie Wu
Algorithms--Representing overlaps • Over multigraph OM(F) of a collection F is the directed, weighted multigraph • Set V of nodes of this structure is just F itself. • A directed edge from a to a different fragment b with weight t 0 exists if the suffix of a with t characters is a prefix of b • May be many edges from a to b • No self-loops ® Pei-Jie Wu
Algorithms--Paths originating superstrings • Edge e = (f, g) in the path has a certain weight t, which means that the last t bases of the tail f of e • Figure 4.15 • Example in p.121 • Equation 4.3 • Hamiltonian paths: A path that goes through every vertex • Equation 4.4 • Minimizing |S(P)| maximizing w(P) ® Pei-Jie Wu
Algorithms--Shortest superstrings as paths • A collection F is said to be substring-free if there are no two distinct strings a and b in such that a is a substring of b. • THEOREM 4.1 • COROLLARY 4.1 • LEMMA 4.1 • THEOREM 4.2 ® Pei-Jie Wu
Algorithms--The greedy algorithm • Looking for shortest common superstrings is the same as looking for Hamiltonian paths of maximum weight in a directed multigraph. • OM(F) OG(F) • “greedy” attempt at computing the heaveiest path. The basic idea employed in it is to continuously add the heaviest available edge ® Pei-Jie Wu
Algorithms--The greedy algorithm • Three conditions we have to test before accepting an edge in our Hamiltonian path: • Edges are processed in nonincreasing order by weight • The procedure ends when we have exactly n-1 edges, or • when the accepted edges induce a connected subgraph. • Figure 4.16 • Example 4.5 • Figure 4.17 ® Pei-Jie Wu
Algorithms--Acyclic subgraphs • Assembling fragments without error and known orientation assuming that the fragments have been obtained from a “good sampling” of the target DNA. • “good sampling”: fragments cover the entire target molecule, and the collection as a whole to exhibit enough linkage to guarantee a safe assembly. • Figure 4.18 ® Pei-Jie Wu
Algorithms--Acyclic subgraphs • The presence of repeated regions, or repeated element, in the target string S is related to the existence of cycles in the overlap graph. • Cycles in an overlap graph are necessarily due to repeats in S. The converse is not necessarily true; that is, we may have repeats but still an acyclic overlap graph. • THEOREM 4.5 • Algorithm: Topological sorting • Example 4.6 • Figure 4.19, 4.20 and 4.21 ® Pei-Jie Wu
Heuristics • None of the formalisms proposed for fragment assembly are entirely adequate • Fragment assembly can be viewed as a multiple alignment problem with some additional feature: • Each fragment can participate with either the direct or the reverse-complemented sequence. • The sequences themselves are usually much shorter than the alignment itself. ® Pei-Jie Wu
Heuristics • Three criteria according to the second feature: • Scoring • Entropy is a quantity that is defied on a group of relative frequencies, and it is low when one of these frequencies stands out from the others, and high when they are all more or less equal • Lower the entropy, the better • Coverage: • A fragment covers a column i if it participates in this column either with a character or with an internal space. • Linkage • The way individual fragment are linked in the layout is another determinant of layout quality. • Figure 4.22 ® Pei-Jie Wu
Heuristics--Assembly in practice • Practical implementations often divide the whole problem in three phase: • Finding overlaps • Building a layout • Computing the consensus ® Pei-Jie Wu
Heuristics--Assembly in practice Finding overlaps • The first step in any assembly problem is fragment overlap delection. • Determine reverse complement • Consider fragments entirely contained in other fragment • Recall Section 3.2.3 • Figure 4.23 ® Pei-Jie Wu
Heuristics--Assembly in practice Ordering fragments • Finding a good ordering of fragments in a contig • No algorithm that is simple and general enough • There are four issues to keep in mind when building paths: • Every path has a corresponding complement path • It is not necessary to include contain fragments • Cycles usually indicate the presence of repeats • Unbalanced coverage may be related to repeats as well (see Figure 4.13) ® Pei-Jie Wu
Heuristics--Assembly in practice Alignment and consensus • Building a layout from a path in an overlap graph • Two techniques related to alignment construction: • The first one helps in building a good layout from a path in the presence of errors. • Example 4.7 • Implement: Figure 4.24 • The second one focuses on locally improving an already constructed layout • Example 4.8 in Figure 4.25 • Implement: sum-of-pairs scoring scheme ® Pei-Jie Wu