DNA Fragment Assembly

DNA Fragment Assembly CIS 667 Spring 2004 February 18

Objectives • The problem: DNA Fragment Assembly • The ideal case • The complications • Models • The Shortest Common Superstring • Reconstruction • Multicontig • A greedy algorithm • Heuristics

The Problem • Assumption: We know the length of the target sequence approximately • The problem: Given a set of fragments from DNA, we want deduce the whole sequence of the DNA. • We determine only one of the strands of the original molecule

The ideal case Input: • The set of fragments: ACCGT CGTGC TTAC TACCGT • Total length 10bp Output: _ _ A C C G T _ _ _ _ _ _ C G T G C T T A C _ _ _ _ _ _ T A C C G T _ _ T T A C C G T G C (consensus by majority of votes)

Complications • real problem instance is very large • errors • substitutions • insertions • deletions • chimeras • unknown orientation of the fragments • repeated regions • causes ambiguity in sequencing • lack of coverage • causes gaps

Errors: Substitution Input: • The set of fragments: ACCGT CGTGC TTAC TGCCGT  substitution • Total length 10bp Output: _ _ A C C G T _ _ _ _ _ _ C G T G C T T A C _ _ _ _ _ _ T G C C G T _ _ T T A C C G T G C (consensus by majority of votes)

Errors: Insertion Input: • The set of fragments: ACCGT CAGTGC  insertion TTAC TACCGT • Total length 10bp Output: _ _ A C C * G T _ _ _ _ _ _ C A G T G C T T A C _ * _ _ _ _ _ T A C C * G T _ _ T T A C C * G T G C (consensus by majority of votes)

Errors: Chimeras Input: • The set of fragments: ACCGT, CGTGC, TTAC, TACCGT, TTATGC • Total length 10bp Output: _ _ A C C G T _ _ _ _ _ _ C G T G C T T A C _ _ _ _ _ _ T A C C G T _ _ T T A C C G T G C (consensus) T T A _ _ _ T G C A chimera arises when two regular fragments from distinct parts of the target molecule join end-to end Remedy: recognize them before use!

Repeated Regions • Unknown orientation with no errors • Unknown orientation with errors • Repeated regions causes ambiguity P X Q X R X S P X R X Q X S

Direct repeat • Direct repeat • More complex are inverted repeat • repeated regions in opposite strands P X Q Y R X S Y P X S Y R X Q Y

Lack of coverage • causes formation of gaps • compute the mean coverage • add up all the fragments and divide by the target length • insufficient coverage is covered by sampling more fragments • How many fragments do I need? • Assume • all fragments have the same length • let t be the safe overlap of at least t bases • n is the number of fragments • T is the target length Apparent contigs: p = n e –n(l-t)/T

Shortest Common Superstring Input: A collection F of strings Output: A shortest possible string S |  fF, S is a superstring of f. Example: F={ATG, TGC, GCC} S= ATGCC Question: Is it the shortest? Observe: u=ATG and v=GCC overlap in G and TGC is a substring

Shortest Common Superstring Is it a good problem? Advantages: • The problem finds the PERFECT superstring • Good for most ideal cases Disadvantages: • the problem does not deal with errors • good only in some ideal cases • in presence of no errors and known orientation, it fails in presence of repeat • repeated identical copies get absorbed in the search of the SHORTEST superstring and produces an assembly of uneven coverage • It does not consider lack of coverage and size of the target • NP-hard

Reconstruction Objective:We want to consider errors and unknown orientation Substring Edit Distance ds(a,b) = min dss(b)(a,s) • one unit is charged for insertion, deletion, substitution • no charges for deletion in the extremity of 2nd sequence Example u=CGATGT v=AACTAATGTGC _ _ C G A * T G T _ _ A A C T A A T G T G C ds(u,v) = 2 • A string f is an approximate substring of S at error level  (between 0, 1) whends(f,S)  |f|

Reconstruction Input: A collection F of strings, an error tolerance  with 01 Output: A shortest possible string S |  fF, we have min(ds(f, S), ds(f ,S))  |f| where f is the reverse complement Advantage: • takes into account errors and unknown orientation Disadvantages: • Is an NP-hard problem • It does not model repeats • It does not consider lack of coverage and size of the target

Multicontig Objective: We want to consider internal linkage No special assumptions except: • for known orientation, fragment and reverse complement are not both present in the collection. • We want to have good linkage (overlap between fragments) • An overlap is a link if it is not (properly) contained in a bigger fragment • The smallest size of a link in a layout is called a weakest link • A layout is a t-contig if its weakest link is at least size t • We partition F into the minimum number of collections which admit a t-contig

Multicontig Idea: Let's partition F in the minimum number of t-contigs! Example: F={GTAG, TAATG, TGTAA} for t=3 F1={TAATG, TGTAA} and F2={GTAG} for t=2 we have two solutions • F1={TAATG, TGTAA} and F2={GTAG} • F1={TAATG,TGTAA} and F2={GTAG} for t=1 we have the desired solution (the minimum) F1={TAATG, TGTAA, GTAG} For errors, we use the consensus of the multi-alignment and insist that the edit distance of the fragments be small

Multicontig Input: A collection F of strings, and an integer t0 and an error tolerance  with 01 Output: A partition of F in the minimum number of subcollections Ci. 1i  k | every Ci admits a t-contig with an -consensus Advantage: • takes into account errors and unknown orientation • take into account internal linkage of the fragments • the answer is formed by several contigs Disadvantages: • Is an NP-hard problem even in the simplest case of no errors and known orientation • It contains as a special case finding a Hamiltonian path in a restricted class of graphs • It has no provision to use information on the approximate size of the target

TACGG Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 1 1 2 CTAAAG GCCC 1 1 1 GGACG Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (at)b = a(tb) or |a|-ta = b|b|-t

Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 1 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (at)b = a(tb) or |a|-ta = b|b|-t TACGG 1 2 CTAAAG GCCC 1 1 1 GGACAG

TACGG Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 1 1 2 CTAAAG GCCC 1 1 1 GGACAG Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (at)b = a(tb) or |a|-ta = b|b|-t

Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? 1 Overlap Multigraph t-Overlap: suffix(a,t) = prefix(b,t) or (at)b = a(tb) or |a|-ta = b|b|-t TACGG 1 2 CTAAAG GCCC 1 1 1 GGACAG

Theoretical results Theorem: the total length of A (set of fragments) is ||A|| = w(P) + |S(P)| where • ||A||=aA |a| • w(P) is the weight of the path P • |S(P)| is the length of the superstring derived from P. • to convince yourself, read the proof from the book Other theoretical results: Looking at the shortest common superstring is the same as looking for the Hamiltonian path of maximum weight in a directed multigraph.

The Greedy Methodology NP-hard problems cannot be solved in reasonable time, but we can look for approximate solutions in reasonable time To apply a greedy methodology: • the problem must show optimal substructure • A problem exhibits optimal substructure if the optimal solution to a problem contains within it optimal solutions to other problems • the optimal solution is reached by taking the best "local" choice

TACGA 1 2 ACCC CTAAAG 1 1 1 GACA Overlap Graph • The Greedy Algorithm • input: weighted di-graph OG(F) with n vertices • output: Hamiltonian path in OG(F) • //Initialize • for i 1 to n do • in[i]=0 //how many selected edges enter i • out[i]=0 //how many selected edges exit i • MakeSet(i) • //Process • Sort the edges by weight, heaviest first • for each edge (f,g) in this order do • //test for acceptance • ifin[g] = 0 andout[f] = 0 and FindSet(f) ≠FindSet(g) • select (f,g) • in[g]1 • out[f] 1 • Union(FindSet(f), Findset(g)) • if there is only one component break • return selected edges An overlap graph has only edges with maximum weight

A graph where "greedy" fails • F={GCAAAG, AGTA,TACGA} We order the edges by weight (AGAT, GCAAAG) = 3 (GCAAAG, AGTA) =2 (AGTA, TACGA) = 2 The algorithm will choose first (AGAT, GCAAAG) = 3 and then is forced to select an edge with weight 0 to complete the path. Instead the solution should be (GCAAAG, AGTA) =2 (AGTA, TACGA) = 2 TACGA 2 GCAAAG 3 2 AGTA

Observations Local optimal decisions do not always work. Can we do any better? Use some heuristics. Issues: Scoring Coverage Linkage

Heuristics • Scoring • Uniformity is good, variability is bad. • Compute the entropy of a column • the entropy is the measure of the chaos in a column. There are 5 possible characters, A, T, C, G, space E=-cpclog pc E=0 if pc=1 for a character; E=log5 if each pc=1/5 • To measure the uniformity we want a low entropy per column • Coverage • minimun, maximum or medium coverage • if the coverage reaches 0 for a column I, we do not have a connected layout • if we have more columns with zero coverage, any permutation of the intervening regions (the contig) is acceptable • Coverage gives confidence to the consensus • Linkage • High coverage with no links is not good. Overlap is required.

More Observations Local optimal decisions do not always work. Can we do any better? Use some heuristics. • Assembly in practice consists of: • Finding overlaps • Building Layout • Computing the consensus Advantages:  • We treat each problem separately. Disadvantages:  • It becomes difficult to understand the relationship between the input and the final output

Heuristics • Finding overlaps • use a dynamic programming approach with a score system such as • 1 for matches • -1 for mismatches • -2 for spaces • Do not charge for space after the first sequence and before the second one.

Heuristics Ordering Fragments • there is no algorithm simple and general enough Considerations: • Use the set DF=F F • If f=uv g=wx then g =wx f = vu • if f is approximately the same as the beginning of g we can expect that whatever is the criterion used to assess the similarity between f and g, the same criterion will apply to their reverse complement

Finding overlaps • Finding a good ordering of overlapping means finding a direct path in the overlap graph • Both strands are constructed simultaneously • Contained fragment are not essential in the path • A disconnected graph indicates lack of coverage • The presence of cycles indicates repeats • Unusual high coverage indicate possible repeats • The presence of reverse complement cycles indicates inverted repeats

Alignment and Consensus • Use the minimal sum of the distances Suppose we have f  g  h CATAGTC TAACTAT AGACTATCC Two semiglobal aligments for f and g are: C A TAG T C_ _ _C ATA GT C_ _ _ _ _ TAA _ C TA T _ _TA_ A CT A T C ATA GT C_ _ _ _ _TA _ A CT A T _ _ _A G A CT A T C C CATA GA C T A T C C • ds(f, S) = 1 ds(g, S) = 1 ds(h, S) = 0 if we use the second aligment and ds(f, S) +ds(g, S) +ds(h, S) = 2 • ds(f, S) = 1 ds(g, S) = 2 ds(h, S) = 0 if we use the first aligment and A is chosen for column 6, ds(f, S) +ds(g, S) +ds(h, S) = 3

A Linked List of Bases • Sometimes we know what is best only later. Is there a structure that helps us? • Use a Linked List of Bases • matches bases are unified in one node • unmatched bases are left separate • Technique: Traverse this graph in topological order. G  T C  A  T  A C  T  A  T A

Conclusions • The models fail to address all the issues involved in the problem • The effective real problem is NP-hard • Approximation gives us some help but fails in some cases • Heuristics helps and the problem is broken in 3 smaller problems: • finding overlap • building layout and • computing the consensus • Are we sure there is nothing else to do? • We will look next week at the smaller problem of comparing only two sequences instead of many. Will we find something better?

DNA Fragment Assembly

DNA Fragment Assembly

Presentation Transcript

Graph Theory Aiding DNA Fragment Assembly

DNA Sequencing and Assembly

Robust Self-Assembly of DNA

DNA Computing by Self Assembly

DNA Nanotechnology The science of assembly

DNA guided assembly line

DNA aSsEmBlY tEcHnIqUes

Fragment Assembly

Fragment Assembly

A new Approach to Fragment Assembly in DNA Sequenceing

DNA Transcription and Protein Assembly

Fragment Assembly

DNA Self-Assembly

DNA Sequence Analysis and Fragment Assembly System (FAS)

DNA Sequence Analysis and Fragment Assembly System (FAS)

Biological Motivation for Fragment Assembly

De Novo Repeat Classification and Fragment Assembly

DNA Fragment Assembly

Biological Motivation for Fragment Assembly

A Fast Hybrid Short Read Fragment Assembly Algorithm

DNA mediated Self-assembly of Nanoarchitectures

CSCI2950-C Lecture 3 DNA Sequencing and Fragment Assembly