1 / 31

Fragment Assembly

Fragment Assembly. Introduction. Fragments are typically of 200-700 bp long “Target” string is about 30k – 100k bp long Problem: given a set of fragments reconstruct the target. Introduction. Multiple-alignment of the fragments ignoring spaces at the end The alignment is called “layout”

cheche
Download Presentation

Fragment Assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fragment Assembly

  2. Introduction • Fragments are typically of 200-700 bp long • “Target” string is about 30k – 100k bp long • Problem: given a set of fragments reconstruct the target

  3. Introduction • Multiple-alignment of the fragments ignoring spaces at the end • The alignment is called “layout” • The output is called the “consensus sequence” • An optimization problem

  4. Complications • Base-call errors: • Substitution errors [p 107] • Insertion errors (possibly from the host sequence) [p 108, fig 4.3] • Deletion error [fig 4.4] • Majority voting solves them (or some form of optimization)

  5. Complications • Chimeras: • To non-contiguous fragments get joined as a single fragment [p 109, fig 4.5] • Needs to be weeded out as a preprocessing step • Similar to chimeras, contaminant fragments (possibly from host) needs to be filtered out as well

  6. Complications • Unknown orientation: • Fragments may come from either strand • Even from the opposite strand, its reverse-complement must be in the target string • Consequence: try both forward and rev-complement of each fragment (2^n trial in worst, for n fragments) • [p 109, fig 4.6]

  7. Complications • Repeats: • Regions (super-string of some fragments) may repeat in a target • Consequent problem: where do the fragments really come from, on approximate alignment? [p 110, fig 4.7] • Problem 2: where should the inter-repeat fragments go? [p111, fig 4.8, fig 4.9] • Inverted repeats: repeat of the reverse complement [fig 4.10]

  8. Complications • Insufficient coverage: • Chance of coverage increases with redundancy (a heuristic: cover 8 times the target length) • Chance of covering a gap reduces when it remains uncovered even after multiple fragments are aligned): random sampling is not good solution here

  9. Complications • Insufficient coverage: • What you get with insufficient coverage is multiple “contigs,” not one contig • “t-contig” is where we expect t-long overlap between pairs of fragments • Expected number of contigs: [p 112, formula 4.1] • Lower t means lesser number of contigs (more aligned segments), but weaker consensus

  10. Reconstruction • Shortest common superstrings are not the best solution • Fig 4.12 vs Fig 4.13 (p115/116)

  11. Reconstruction • Superstring to be reconstructed out of fragments • An alignment problem with no end penalty • d_s is edit distance score without end-penalty: minimized over edit distances d • Fig 4.14 (p117) for best aligned subsequence-matching • Note, char matched is charged 0, mismatch 1, gap 2, in “distance” rather than “similarity” • We will use d for d_s

  12. Reconstruction • f is approximate substring of S at error level e, then the score is d(f, S) =< e|f|, e=1 means no error allowed e<1 allows insert/delete/substitution errors • f and f- both should be matched

  13. Reconstruction: Problem • Input: Set F of substrings, error level e • Output: Shortest possible string S s.t. for all f Min(d(f, S), d(f-, S)) =< e|f|

  14. Reconstruction: Multicontig • How much overlap do we require between strings? • Ideally, each column in the layout L should have same character, for all columns 1 through |L| • Fig 4.4 (p 118): t-contig for t=3, 2, 1 • Balance between t and number of t-contigs

  15. Reconstruction: Multicontig • S is e-consensus sequence (multicontig) for 0=<e=<1: edit distance d(f, S) =< e|f| • Multicontig problem: • Input: set F, integer t>=0, 0=<e=<1 • Output: Minimum partition over F, each partition Ci is a t-contig with e-consensus

  16. Reconstruction: Overlap Multi-graph • Nodes are the fragments • Directed arcs label length t of overlap between nodes” t-suffix= t-prefix • Arcs between all pairs of nodes, but no self-loop • Fig 4.15 (p 121): example • Length of a created superstring=total wt along the path(or overlaps) + total length of all fragments involved • Max weight Hamiltonian path is what we are looking for in this graph  max overlapped superstring

  17. Reconstruction • Substrings of fragments within the set of fragments are noise: remove them • Draw OMG of the substring free set of fragments • Shortest common superstring always correspond to a Hamiltonian path in this graph

  18. Reconstruction: OMG • Thm 4.1 (p 123): F substring free, for every common superstring S, there is a Ham. Path P, s.t., S(P) is in S • Substrings are strictly ordered over S: order of left pts = order of rt points (otherwise substring exists) • Path follows the same order of fragments (as in S) in OMG • S may contain extra garbage materials, so, S(P) is within S

  19. Reconstruction: OMG • If S is shortest common superstring, then S must be within S(P), or S=S(P) • In other words, a Ham. Path in OMG for substring-free collection F’ is a shortest common superstring of the Fragment set F

  20. Reconstruction: OMG • Think of an algorithm for weeding out substrings from F • Also, weed out multi-edges by keeping the largest wt edge between any pair of nodes • If the wt on an edge is below a threshold t, then the wt should be treated as 0

  21. Reconstruction: OMG • Greedy Algorithm to draw Ham. Path (p 125) • Collects edges largest to smallest, (1) preventing cycle (union-find), (2) indegree of each node should be =<1 (first node has 0) (3) outdegree of each node should be =<1 (last node has 0) [Does not return Ham. Path. Can you modify to return Ham. Path?] • Alg is NOT optimal, example (p 126): returns 3, optimal wt is 4

  22. Reconstruction: OMG • Subintervals: if a fragment can be embedded within another one in the set • Subinterval-free and repeat-free graphs connected at level t has a Ham. Path that generates the target string

  23. Reconstruction: OMG • If a repeat exists in the original string, then the graph will have a cycle • False positive: substrings from two different portions has t-overlap • If a cycle exist in the graph, then there must be a “false positive” (Thm 4.4, p129): proof by contradiction, otherwise the subinterval-free fragments can be totally ordered

  24. Reconstruction: OMG • If there is no repeats in a subinterval-free graph, then there exist a unique Ham. Path • If there exist a cycle it may not come from a repeat

  25. Reconstruction: OMG • Example 4.6 (p 130): greedy alg finds wrong string, but the Ham. Path finds the correct one • Greedy does not care about linkage (optimizes on total overlap – finds shortest common superstring) • Ham path chooses any t-overlap connections – cares for linkage only

  26. Parameters in aligning for fragment assembly • Scoreon a column: traditionally {0,-1,-2} in sum-of-pairs • Entropy: Sum[over alphabets and space c] –pc log pc, where pc is probability of c • All same character, pc = 1, entropy=0 • For {a, t, c, g, -}, all different, pc = 1/5, entropy=log 5entropy measures uniformity alone, a better metric

  27. Parameters in aligning for fragment assembly • Coverage: How many each column is “covered” by how many fragments? (Average, min, max) • This is different from the concept of t-overlap • If a column (of the target) is covered by 0, then the layout is disconnected • Counteracts with the requirement of subinterval-free collection if we expect coverage>1 for all columns

  28. Parameters in aligning for fragment assembly • Coverage is not enough, we need good linkage, Example: p 133 • Ham. Path algorithm is doing that

  29. Steps in assembly : • Step 1: Overlap finding • Approximate – delete, insert, replace allowed • by semi-global DP algorithm • with appropriate end-gap penalty, • pairwise between each fragment and its reverse-complement

  30. Steps in assembly : • Step 2: Construct over (F union F-bar) for the fragment set F • (-- after eliminating substrings?) • Construct Hamiltonian path in this graph • Cycles and unbalanced coverage may mean repeats

  31. Steps in assembly : • Step 3: fine tuning the multiple alignment to get a consensus target • Manual or algorithmic • Examples in p 137-138

More Related