390 likes | 414 Views
This work in progress focuses on reconstructing ancestral DNA by filling the gaps in the DNA sequence through the use of unrooted phylogeny, multiple alignment, and an affine gap cost function.
E N D
Reconstructing Ancestral DNA (at least the gaps) Using unrooted phylogeny, multiple alignment, and affine gap cost function. Work in progress.
Overview • Introduction • Examples • Gap Graph construction • Theory • Algorithm • Results • Next steps
Example N3???
Example N3nnn • (a) Two long indels.
Example N3n-n • (a) Two long indels. • (b) Three short indels.
Example N3nnn/n-n • (a) Two long indels. • (b) Three short indels. • Which is more parsimonious depends on gap cost function: • Cost of indel of length k is g(k) = a + b*k
Harder Example N8, N9, N10, N11, N12, N13??? Problem: find optimal explanation for gaps in terms of indels.
Gap Representation • Find gap intervals • Create minimal tree covering in each gap interval: minimal number of subtrees with gaps in all leaves
Gap Representation • Find gap intervals • Create minimal tree covering in each gap interval: minimal number of subtrees with gaps in all leaves • Vertex: • subtree with gaps in all leaves • section of alignment
Gap Representation • Find gap intervals • Create minimal tree covering in each gap interval: minimal number of subtrees with gaps in all leaves
Gap Representation • Find gap intervals • Create minimal tree covering in each gap interval: minimal number of subtrees with gaps in all leaves
Gap Representation • Find gap intervals • Create minimal tree covering in each gap interval: minimal number of subtrees with gaps in all leaves
Gap Representation • Find gap intervals • Create minimal tree covering in each gap interval: minimal number of subtrees with gaps in all leaves
Gap Representation • Find gap intervals • Create minimal tree covering in each gap interval: minimal number of subtrees with gaps in all leaves
Gap Representation • Find gap intervals • Create minimal tree covering in each gap interval: minimal number of subtrees with gaps in all leaves
Gap Graph Construction 3. Create connections between neighbors v and w if one is contained in the other.
Gap Graph Construction 3. Create connections between neighbors v and w if one is contained in the other.
Gap Graph Construction 3. Create connections between neighbors v and w if one is contained in the other.
Gap Graph Construction 3. Create connections between neighbors v and w if one is contained in the other.
What is a vertex? Either oneindel created all gaps in the subtree, or the vertex (subtree) is decomposed into several indels. Algorithm goal: confirm or decompose vertices using gap cost function.
Flashback: ~ Jotun’s Algorithm This example can be solved optimally: using a=5, b=3, all vertices are confirmed. - i.e., all gaps created ‘as high as possible’ in the tree.
Horrific Counter Example At first sight: confirm all vertices.. (0,1) (1,2,3,4) (0,1,2,3)
Horrific Counter Example At first sight: confirm all vertices.. 6 indels. (0,1) (1,2,3,4) (0,1,2,3)
Horrific Counter Example At first sight: confirm all vertices.. 6 indels. BUT: solution with 5 indels can be found! Depending on gap cost function, this may be cheaper. Thus first solution may not be optimal Problem: the indel (2) is invisible. (0,1) (1,2,3,4) (0,1,2,3)
New Type of Connection Needed! 3. Create connections between neighbors v and w if one is contained in the other if they share leaves. - The indel (2) lies in the intersection of the cousins. (0,1) (1,2,3,4) (0,1,2,3)
Now The(st)ory Begins By construction of the gap graph, we can prove two theorems: Theorem 1 Each optimal indel either corresponds directly to a vertex, or it crosses a cousin connection. Only possible optimal indels: (0,1) (3) (0,1,2,3) (1,2,3,4) (1) (4) (2) (1,2,3,4) (0,1) (0,1,2,3)
Now Theory Begins By construction of the gap graph, we can prove two theorems: Theorem 2 If a vertex v is decomposed in the optimal solution, all decomposing indels extend beyond v’s section of the alignment, and they do not all extend in the same direction. Thus we have to decompose none or both of (0,1,2,3) and (1,2,3,4): otherwise (2) doesn’t extend beyond the region of (0,1,2,3) (1,2,3,4) (0,1,2,3)
Now Theory Begins From the theorems we can prove some lemmas: 1: Leaf vertices can be confirmed. 2: Orphans / end vertices can be confirmed. 3: Patriarchs can be confirmed and trimmed.
Solving Earlier Example 1: Leaf vertices can be confirmed.
Solving Earlier Example 1: Leaf vertices can be confirmed. 2: Orphans / end vertices can be confirmed.
Solving Earlier Example 1: Leaf vertices can be confirmed. 2: Orphans / end vertices can be confirmed. 3: Patriarchs can be confirmed and trimmed.
Solving Earlier Example 1: Leaf vertices can be confirmed. 2: Orphans / end vertices can be confirmed. 3: Patriarchs can be confirmed and trimmed.
Solving Earlier Example 1: Leaf vertices can be confirmed. 2: Orphans / end vertices can be confirmed. 3: Patriarchs can be confirmed and trimmed. 4: Mono-chain vertices can be decided locally.
End of Pre-Processing • In longer examples there will be undecided vertices (purple) after pre-processing. • Find possible decompositions for each vertex and check all combinations in each chain
9 sequences, 60% gaps, preproc.time < 4 s --------------------- • Alignment length 3936, divided in 3922 gap intervals. • --------------------- • 1497 vertices undecided before trimming. • 1112 vertices undecided after trimming. • --------------------- • Created 8912 vertices, 871 connections. Confirmed • 5469 leaf vertices, • 2285 patriarchs, • 210 end vertices, • 217 locally confirmed non-cousin chain vertices, • 37 locally confirmed cousin chain vertices, and • 487 mono-chain decomposed vertices. • --------------------- • 207 vertices undecided after all preprocessing. • #chains with undecided: 89, max #undecided in same chain (C31): 7 • estimated number of combinations: 2788, max in same chain: 1152 • ---------------------
9 sequences, 60% gaps, preproc.time < 4 s --------------------- • Alignment length 3936, divided in 3922 gap intervals. • --------------------- • 1497 vertices undecided before trimming. • 1112 vertices undecided after trimming. • --------------------- • Created 8912 vertices, 871 connections. Confirmed • 5469 leaf vertices, • 2285 patriarchs, • 210 end vertices, • 217 locally confirmed non-cousin chain vertices, • 37 locally confirmed cousin chain vertices, and • 487 mono-chain decomposed vertices. • --------------------- • 207 vertices undecided after all preprocessing. • #chains with undecided: 89, max #undecided in same chain (C31): 7 • estimated number of combinations: 2788, max in same chain: 1152 • ---------------------
Is Pre-Processing Important? 9 sequences, 60% gaps; no pre-processing: • --------------------- • Created 10082 vertices, 7121 connections. • --------------------- • 1497 vertices undecided with no preprocessing. • #chains with undecided: 950, max #undecided in same chain (C40): 10 • estimated number of combinations: 71950, max in same chain: 34560 9 sequences, 60% gaps; with pre-processing: • --------------------- • Created 8912 vertices, 871 connections. • --------------------- • 207 vertices undecided after all preprocessing. • #chains with undecided: 89, max #undecided in same chain (C31): 7 • estimated number of combinations: 2788, max in same chain: 1152
Next Steps • Make poster for Recomb (suggestions?) • Finish program • Run it on real data • Ideas for applications? (Score ranks alignment – use to find alignment..) • Demo