Fragment Assembly

Fragment Assembly

Introduction • Fragments are typically of 200-700 bp long • “Target” string is about 30k – 100k bp long • Problem: given a set of fragments reconstruct the target

Introduction • Multiple-alignment of the fragments ignoring spaces at the end • The alignment is called “layout” • The output is called the “consensus sequence” • An optimization problem

Complications • Base-call errors: • Substitution errors [p 107] • Insertion errors (possibly from the host sequence) [p 108, fig 4.3] • Deletion error [fig 4.4] • Majority voting solves them (or some form of optimization)

Complications • Chimeras: • To non-contiguous fragments get joined as a single fragment [p 109, fig 4.5] • Needs to be weeded out as a preprocessing step • Similar to chimeras, contaminant fragments (possibly from host) needs to be filtered out as well

Complications • Unknown orientation: • Fragments may come from either strand • Even from the opposite strand, its reverse-complement must be in the target string • Consequence: try both forward and rev-complement of each fragment (2^n trial in worst, for n fragments) • [p 109, fig 4.6]

Complications • Repeats: • Regions (super-string of some fragments) may repeat in a target • Consequent problem: where do the fragments really come from, on approximate alignment? [p 110, fig 4.7] • Problem 2: where should the inter-repeat fragments go? [p111, fig 4.8, fig 4.9] • Inverted repeats: repeat of the reverse complement [fig 4.10]

Complications • Insufficient coverage: • Chance of coverage increases with redundancy (a heuristic: cover 8 times the target length) • Chance of covering a gap reduces when it remains uncovered even after multiple fragments are aligned): random sampling is not good solution here

Complications • Insufficient coverage: • What you get with insufficient coverage is multiple “contigs,” not one contig • “t-contig” is where we expect t-long overlap between pairs of fragments • Expected number of contigs: [p 112, formula 4.1] • Lower t means lesser number of contigs (more aligned segments), but weaker consensus

Reconstruction • Shortest common superstrings are not the best solution • Fig 4.12 vs Fig 4.13 (p115/116)

Reconstruction • Superstring to be reconstructed out of fragments • An alignment problem with no end penalty • d_s is edit distance score without end-penalty: minimized over edit distances d • Fig 4.14 (p117) for best aligned subsequence-matching • Note, char matched is charged 0, mismatch 1, gap 2, in “distance” rather than “similarity” • We will use d for d_s

Reconstruction • f is approximate substring of S at error level e, then the score is d(f, S) =< e|f|, e=1 means no error allowed e<1 allows insert/delete/substitution errors • f and f- both should be matched

Reconstruction: Problem • Input: Set F of substrings, error level e • Output: Shortest possible string S s.t. for all f Min(d(f, S), d(f-, S)) =< e|f|

Reconstruction: Multicontig • How much overlap do we require between strings? • Ideally, each column in the layout L should have same character, for all columns 1 through |L| • Fig 4.4 (p 118): t-contig for t=3, 2, 1 • Balance between t and number of t-contigs

Reconstruction: Multicontig • S is e-consensus sequence (multicontig) for 0=<e=<1: edit distance d(f, S) =< e|f| • Multicontig problem: • Input: set F, integer t>=0, 0=<e=<1 • Output: Minimum partition over F, each partition Ci is a t-contig with e-consensus

Reconstruction: Overlap Multi-graph • Nodes are the fragments • Directed arcs label length t of overlap between nodes” t-suffix= t-prefix • Arcs between all pairs of nodes, but no self-loop • Fig 4.15 (p 121): example • Length of a created superstring=total wt along the path(or overlaps) + total length of all fragments involved • Max weight Hamiltonian path is what we are looking for in this graph  max overlapped superstring

Reconstruction • Substrings of fragments within the set of fragments are noise: remove them • Draw OMG of the substring free set of fragments • Shortest common superstring always correspond to a Hamiltonian path in this graph

Reconstruction: OMG • Thm 4.1 (p 123): F substring free, for every common superstring S, there is a Ham. Path P, s.t., S(P) is in S • Substrings are strictly ordered over S: order of left pts = order of rt points (otherwise substring exists) • Path follows the same order of fragments (as in S) in OMG • S may contain extra garbage materials, so, S(P) is within S

Reconstruction: OMG • If S is shortest common superstring, then S must be within S(P), or S=S(P) • In other words, a Ham. Path in OMG for substring-free collection F’ is a shortest common superstring of the Fragment set F

Reconstruction: OMG • Think of an algorithm for weeding out substrings from F • Also, weed out multi-edges by keeping the largest wt edge between any pair of nodes • If the wt on an edge is below a threshold t, then the wt should be treated as 0

Reconstruction: OMG • Greedy Algorithm to draw Ham. Path (p 125) • Collects edges largest to smallest, (1) preventing cycle (union-find), (2) indegree of each node should be =<1 (first node has 0) (3) outdegree of each node should be =<1 (last node has 0) [Does not return Ham. Path. Can you modify to return Ham. Path?] • Alg is NOT optimal, example (p 126): returns 3, optimal wt is 4

Reconstruction: OMG • Subintervals: if a fragment can be embedded within another one in the set • Subinterval-free and repeat-free graphs connected at level t has a Ham. Path that generates the target string

Reconstruction: OMG • If a repeat exists in the original string, then the graph will have a cycle • False positive: substrings from two different portions has t-overlap • If a cycle exist in the graph, then there must be a “false positive” (Thm 4.4, p129): proof by contradiction, otherwise the subinterval-free fragments can be totally ordered

Reconstruction: OMG • If there is no repeats in a subinterval-free graph, then there exist a unique Ham. Path • If there exist a cycle it may not come from a repeat

Reconstruction: OMG • Example 4.6 (p 130): greedy alg finds wrong string, but the Ham. Path finds the correct one • Greedy does not care about linkage (optimizes on total overlap – finds shortest common superstring) • Ham path chooses any t-overlap connections – cares for linkage only

Parameters in aligning for fragment assembly • Scoreon a column: traditionally {0,-1,-2} in sum-of-pairs • Entropy: Sum[over alphabets and space c] –pc log pc, where pc is probability of c • All same character, pc = 1, entropy=0 • For {a, t, c, g, -}, all different, pc = 1/5, entropy=log 5entropy measures uniformity alone, a better metric

Parameters in aligning for fragment assembly • Coverage: How many each column is “covered” by how many fragments? (Average, min, max) • This is different from the concept of t-overlap • If a column (of the target) is covered by 0, then the layout is disconnected • Counteracts with the requirement of subinterval-free collection if we expect coverage>1 for all columns

Parameters in aligning for fragment assembly • Coverage is not enough, we need good linkage, Example: p 133 • Ham. Path algorithm is doing that

Steps in assembly : • Step 1: Overlap finding • Approximate – delete, insert, replace allowed • by semi-global DP algorithm • with appropriate end-gap penalty, • pairwise between each fragment and its reverse-complement

Steps in assembly : • Step 2: Construct over (F union F-bar) for the fragment set F • (-- after eliminating substrings?) • Construct Hamiltonian path in this graph • Cycles and unbalanced coverage may mean repeats

Steps in assembly : • Step 3: fine tuning the multiple alignment to get a consensus target • Manual or algorithmic • Examples in p 137-138

Fragment Assembly