390 likes | 613 Views
DNA Fragment Assembly. CIS 667 Spring 2004 February 18. Objectives. The problem: DNA Fragment Assembly The ideal case The complications Models The Shortest Common Superstring Reconstruction Multicontig A greedy algorithm Heuristics. The Problem.
E N D
DNA Fragment Assembly CIS 667 Spring 2004 February 18
Objectives • The problem: DNA Fragment Assembly • The ideal case • The complications • Models • The Shortest Common Superstring • Reconstruction • Multicontig • A greedy algorithm • Heuristics
The Problem • Assumption: We know the length of the target sequence approximately • The problem: Given a set of fragments from DNA, we want deduce the whole sequence of the DNA. • We determine only one of the strands of the original molecule
The ideal case Input: • The set of fragments: ACCGT CGTGC TTAC TACCGT • Total length 10bp Output: _ _ A C C G T _ _ _ _ _ _ C G T G C T T A C _ _ _ _ _ _ T A C C G T _ _ T T A C C G T G C (consensus by majority of votes)
Complications • real problem instance is very large • errors • substitutions • insertions • deletions • chimeras • unknown orientation of the fragments • repeated regions • causes ambiguity in sequencing • lack of coverage • causes gaps
Errors: Substitution Input: • The set of fragments: ACCGT CGTGC TTAC TGCCGT substitution • Total length 10bp Output: _ _ A C C G T _ _ _ _ _ _ C G T G C T T A C _ _ _ _ _ _ T G C C G T _ _ T T A C C G T G C (consensus by majority of votes)
Errors: Insertion Input: • The set of fragments: ACCGT CAGTGC insertion TTAC TACCGT • Total length 10bp Output: _ _ A C C * G T _ _ _ _ _ _ C A G T G C T T A C _ * _ _ _ _ _ T A C C * G T _ _ T T A C C * G T G C (consensus by majority of votes)
Errors: Chimeras Input: • The set of fragments: ACCGT, CGTGC, TTAC, TACCGT, TTATGC • Total length 10bp Output: _ _ A C C G T _ _ _ _ _ _ C G T G C T T A C _ _ _ _ _ _ T A C C G T _ _ T T A C C G T G C (consensus) T T A _ _ _ T G C A chimera arises when two regular fragments from distinct parts of the target molecule join end-to end Remedy: recognize them before use!
Repeated Regions • Unknown orientation with no errors • Unknown orientation with errors • Repeated regions causes ambiguity P X Q X R X S P X R X Q X S
Direct repeat • Direct repeat • More complex are inverted repeat • repeated regions in opposite strands P X Q Y R X S Y P X S Y R X Q Y
Lack of coverage • causes formation of gaps • compute the mean coverage • add up all the fragments and divide by the target length • insufficient coverage is covered by sampling more fragments • How many fragments do I need? • Assume • all fragments have the same length • let t be the safe overlap of at least t bases • n is the number of fragments • T is the target length Apparent contigs: p = n e –n(l-t)/T
Shortest Common Superstring Input: A collection F of strings Output: A shortest possible string S | fF, S is a superstring of f. Example: F={ATG, TGC, GCC} S= ATGCC Question: Is it the shortest? Observe: u=ATG and v=GCC overlap in G and TGC is a substring
Shortest Common Superstring Is it a good problem? Advantages: • The problem finds the PERFECT superstring • Good for most ideal cases Disadvantages: • the problem does not deal with errors • good only in some ideal cases • in presence of no errors and known orientation, it fails in presence of repeat • repeated identical copies get absorbed in the search of the SHORTEST superstring and produces an assembly of uneven coverage • It does not consider lack of coverage and size of the target • NP-hard
Reconstruction Objective:We want to consider errors and unknown orientation Substring Edit Distance ds(a,b) = min dss(b)(a,s) • one unit is charged for insertion, deletion, substitution • no charges for deletion in the extremity of 2nd sequence Example u=CGATGT v=AACTAATGTGC _ _ C G A * T G T _ _ A A C T A A T G T G C ds(u,v) = 2 • A string f is an approximate substring of S at error level (between 0, 1) whends(f,S) |f|
Reconstruction Input: A collection F of strings, an error tolerance with 01 Output: A shortest possible string S | fF, we have min(ds(f, S), ds(f ,S)) |f| where f is the reverse complement Advantage: • takes into account errors and unknown orientation Disadvantages: • Is an NP-hard problem • It does not model repeats • It does not consider lack of coverage and size of the target
Multicontig Objective: We want to consider internal linkage No special assumptions except: • for known orientation, fragment and reverse complement are not both present in the collection. • We want to have good linkage (overlap between fragments) • An overlap is a link if it is not (properly) contained in a bigger fragment • The smallest size of a link in a layout is called a weakest link • A layout is a t-contig if its weakest link is at least size t • We partition F into the minimum number of collections which admit a t-contig
Multicontig Idea: Let's partition F in the minimum number of t-contigs! Example: F={GTAG, TAATG, TGTAA} for t=3 F1={TAATG, TGTAA} and F2={GTAG} for t=2 we have two solutions • F1={TAATG, TGTAA} and F2={GTAG} • F1={TAATG,TGTAA} and F2={GTAG} for t=1 we have the desired solution (the minimum) F1={TAATG, TGTAA, GTAG} For errors, we use the consensus of the multi-alignment and insist that the edit distance of the fragments be small
Multicontig Input: A collection F of strings, and an integer t0 and an error tolerance with 01 Output: A partition of F in the minimum number of subcollections Ci. 1i k | every Ci admits a t-contig with an -consensus Advantage: • takes into account errors and unknown orientation • take into account internal linkage of the fragments • the answer is formed by several contigs Disadvantages: • Is an NP-hard problem even in the simplest case of no errors and known orientation • It contains as a special case finding a Hamiltonian path in a restricted class of graphs • It has no provision to use information on the approximate size of the target
TACGG Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 1 1 2 CTAAAG GCCC 1 1 1 GGACG Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (at)b = a(tb) or |a|-ta = b|b|-t
Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 1 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (at)b = a(tb) or |a|-ta = b|b|-t TACGG 1 2 CTAAAG GCCC 1 1 1 GGACAG
TACGG Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 1 1 2 CTAAAG GCCC 1 1 1 GGACAG Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (at)b = a(tb) or |a|-ta = b|b|-t
Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 1 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (at)b = a(tb) or |a|-ta = b|b|-t TACGG 1 2 CTAAAG GCCC 1 1 1 GGACAG
Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 1 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (at)b = a(tb) or |a|-ta = b|b|-t TACGG 1 2 CTAAAG GCCC 1 1 1 GGACAG
Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 1 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (at)b = a(tb) or |a|-ta = b|b|-t TACGG 1 2 CTAAAG GCCC 1 1 1 GGACAG
Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 1 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (at)b = a(tb) or |a|-ta = b|b|-t TACGG 1 2 CTAAAG GCCC 1 1 1 GGACAG
Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 1 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (at)b = a(tb) or |a|-ta = b|b|-t TACGG 1 2 CTAAAG GCCC 1 1 1 GGACAG
Theoretical results Theorem: the total length of A (set of fragments) is ||A|| = w(P) + |S(P)| where • ||A||=aA |a| • w(P) is the weight of the path P • |S(P)| is the length of the superstring derived from P. • to convince yourself, read the proof from the book Other theoretical results: Looking at the shortest common superstring is the same as looking for the Hamiltonian path of maximum weight in a directed multigraph.
The Greedy Methodology NP-hard problems cannot be solved in reasonable time, but we can look for approximate solutions in reasonable time To apply a greedy methodology: • the problem must show optimal substructure • A problem exhibits optimal substructure if the optimal solution to a problem contains within it optimal solutions to other problems • the optimal solution is reached by taking the best "local" choice
TACGA 1 2 ACCC CTAAAG 1 1 1 GACA Overlap Graph • The Greedy Algorithm • input: weighted di-graph OG(F) with n vertices • output: Hamiltonian path in OG(F) • //Initialize • for i 1 to n do • in[i]=0 //how many selected edges enter i • out[i]=0 //how many selected edges exit i • MakeSet(i) • //Process • Sort the edges by weight, heaviest first • for each edge (f,g) in this order do • //test for acceptance • ifin[g] = 0 andout[f] = 0 and FindSet(f) ≠FindSet(g) • select (f,g) • in[g]1 • out[f] 1 • Union(FindSet(f), Findset(g)) • if there is only one component break • return selected edges An overlap graph has only edges with maximum weight
A graph where "greedy" fails • F={GCAAAG, AGTA,TACGA} We order the edges by weight (AGAT, GCAAAG) = 3 (GCAAAG, AGTA) =2 (AGTA, TACGA) = 2 The algorithm will choose first (AGAT, GCAAAG) = 3 and then is forced to select an edge with weight 0 to complete the path. Instead the solution should be (GCAAAG, AGTA) =2 (AGTA, TACGA) = 2 TACGA 2 GCAAAG 3 2 AGTA
Observations Local optimal decisions do not always work. Can we do any better? Use some heuristics. Issues: Scoring Coverage Linkage
Heuristics • Scoring • Uniformity is good, variability is bad. • Compute the entropy of a column • the entropy is the measure of the chaos in a column. There are 5 possible characters, A, T, C, G, space E=-cpclog pc E=0 if pc=1 for a character; E=log5 if each pc=1/5 • To measure the uniformity we want a low entropy per column • Coverage • minimun, maximum or medium coverage • if the coverage reaches 0 for a column I, we do not have a connected layout • if we have more columns with zero coverage, any permutation of the intervening regions (the contig) is acceptable • Coverage gives confidence to the consensus • Linkage • High coverage with no links is not good. Overlap is required.
More Observations Local optimal decisions do not always work. Can we do any better? Use some heuristics. • Assembly in practice consists of: • Finding overlaps • Building Layout • Computing the consensus Advantages: • We treat each problem separately. Disadvantages: • It becomes difficult to understand the relationship between the input and the final output
Heuristics • Finding overlaps • use a dynamic programming approach with a score system such as • 1 for matches • -1 for mismatches • -2 for spaces • Do not charge for space after the first sequence and before the second one.
Heuristics Ordering Fragments • there is no algorithm simple and general enough Considerations: • Use the set DF=F F • If f=uv g=wx then g =wx f = vu • if f is approximately the same as the beginning of g we can expect that whatever is the criterion used to assess the similarity between f and g, the same criterion will apply to their reverse complement
Finding overlaps • Finding a good ordering of overlapping means finding a direct path in the overlap graph • Both strands are constructed simultaneously • Contained fragment are not essential in the path • A disconnected graph indicates lack of coverage • The presence of cycles indicates repeats • Unusual high coverage indicate possible repeats • The presence of reverse complement cycles indicates inverted repeats
Alignment and Consensus • Use the minimal sum of the distances Suppose we have f g h CATAGTC TAACTAT AGACTATCC Two semiglobal aligments for f and g are: C A TAG T C_ _ _C ATA GT C_ _ _ _ _ TAA _ C TA T _ _TA_ A CT A T C ATA GT C_ _ _ _ _TA _ A CT A T _ _ _A G A CT A T C C CATA GA C T A T C C • ds(f, S) = 1 ds(g, S) = 1 ds(h, S) = 0 if we use the second aligment and ds(f, S) +ds(g, S) +ds(h, S) = 2 • ds(f, S) = 1 ds(g, S) = 2 ds(h, S) = 0 if we use the first aligment and A is chosen for column 6, ds(f, S) +ds(g, S) +ds(h, S) = 3
A Linked List of Bases • Sometimes we know what is best only later. Is there a structure that helps us? • Use a Linked List of Bases • matches bases are unified in one node • unmatched bases are left separate • Technique: Traverse this graph in topological order. G T C A T A C T A T A
Conclusions • The models fail to address all the issues involved in the problem • The effective real problem is NP-hard • Approximation gives us some help but fails in some cases • Heuristics helps and the problem is broken in 3 smaller problems: • finding overlap • building layout and • computing the consensus • Are we sure there is nothing else to do? • We will look next week at the smaller problem of comparing only two sequences instead of many. Will we find something better?