240 likes | 350 Views
Generalized Tree Alignment: The Deferred Path Heuristic. Stinus Lindgreen stinus@diku.dk. Overview: What is a phylogeny? The Generalized Tree Alignment problem Sequence Graphs and their algorithms The Deferred Path Heuristic. Phylogeny: Describes evolutionary model Common ancestor
E N D
Generalized Tree Alignment:The Deferred Path Heuristic Stinus Lindgreen stinus@diku.dk
Overview: What is a phylogeny? The Generalized Tree Alignment problem Sequence Graphs and their algorithms The Deferred Path Heuristic
Phylogeny: Describes evolutionary model • Common ancestor • Mutations happen all the time • Insertions, deletions, substitutions, translocations, inversions, duplications … Most mutations happen in DNA replication • Corrected by cell mechanisms Mutations accumulate → new species diverge Only mutations in sex cells are inherited (obviously)
Phylogeny: Phylogenetic inference: Given n sequences build a phylogenetic tree Most methods base T on a multiple alignment Likewise: Multiple alignments often based on guide trees Can we solve both problems at the same time?
Phylogeny: Describes the evolutionary relationship between species Notice root
Phylogeny: ... or among a single taxon (here, human entovirus 71)
The Problem: Given n sequences s1,…,sn … Multiple Alignment: Make an ordering A of the sequences by inserting gaps such that homologous bases are put in the same column Phylogenetic Inference: Build a (binary) tree T with s1,…,sn in the leaves and possible ancestors sn+1,…,sn+k in internal nodes describing their evolutionary connection
Generalized Tree Alignment: Combines the two. The problem we want to solve is: Given: A set of n sequences s1,…,sn from n different species (could be DNA, RNA or protein – for simplicity we focus on DNA) Problem: Generate an unrooted phylogenetic tree T with sequences s1,…,sn in the leaves and a multiple alignment A of these sequences Placing the root is not trivial and is best left to biologists.
The given problem is proven to be MAXSNP-hard (Wang and Jiang, 1994) → Not possible to find an approximation algorithm. Exact solutions to NP-hard problems are intractable → The best we can hope for is a heuristic The given algorithm runs in time O(n2.ln) • n: The number of sequences • l: Their maximum length.
Sequence graphs (Hein, 1989): Recall pairwise alignment. Traceback ”spells” possible optimal alignments:
Sequence graphs: Make graph with alignment columns as edge labels → represents all optimal alignments We will get back to that shortly … Right now, we want to represent sequences Let us introduce sequence graphs. For instance, s = ACTGTA is represented by:
Sequence graphs: More formally: • Directed, acyclic graph. • Edge labels lfrom alphabet Σ. Here, Σ={A,C,G,T,-} • Source s: The unique node with no incoming edges • Sink t: The unique node with no outgoing edges. • Each path from s to t spells a sequence.
Sequence graphs: Represents a set of sequences given by all paths from s to t:
Sequence graphs: Any single sequence can be represented by a linear sequence graph Any set of k sequences can be represented by making k paths from s to t A given sequence s’ can be represented by more than one path We can now represent sequences – but can we align them?
Aligning sequence graphs: Dynamic programming algorithm inspired by basic Pairwise Alignment: • Given two sequences p and q • Move one letter in p and move through q finding the optimal ”partial alignments” Sequence Graphs: • Given two sequence graphs G1 and G2 • We can have many outgoing edges to choose from
Aligning sequence graphs: Fill in a |V1|*|V2| score matrix For each pair of nodes i from G1 and j from G2: Should we: • Align the two characters we got by following e1 into i and e2 into j? • Stay in G1 and only move in G2? • Stay in G2 and only move in G1? • Or have we already found a better path into i and j?
Optimal Alignment Graphs: Now we need a way to remember the optimal alignments Recall graphs from before: • Directed, acyclic graphs • Nodes s and t defined as before • Edge labels of the form [la,lb] where la,lb∊Σ Backtrack through the matrix and consider each possible combination of edges.
Optimal Alignment Graphs: An example of an OAG: This one represents the alignments: We denote such a graph A* We have to convert the OAGs back to SGs
Optimal Alignment Graphs: This is done easily by considering the edge labels: If la= lb: Make a single edge in the SG with label la If la≠lb: Make two edges in the SG: One with label la and one with label lb The graph from before turns into the SG:
Summing up Sequence Graphs: Final graph represents all sequences giving an optimal alignment between G1 and G2 We can: • Represent a set of sequences by a sequence graph • Align two such graphs producing a new SG We can now get on with the main algorithm
The basic idea: • Start by comparing all sequences • Find a closest pair. • Represent all sequences giving the optimal solution • Defer the choice of a single sequence • Repeat, but this time include the set of sequences • In the end: Choose a single sequence and backtrack This shows a need for: • A compact representation of many sequences • An algorithm for aligning sets of sequences
The Deferred Path Heuristic: Similar to Kruskal’s algorithm for finding MSTs: From sequences s1,…,sn,initialize n SGs G1,…,Gn. Until only two SGs remain: • Align all pairs and choose a closest pair Gi and Gj • Create A*(Gi,Gj) and convert A* into a SG Gk. • Replace Gi and Gj with Gk Note that we remember all candidate sequences
The Deferred Path Heuristic: When only two SGs Gi and Gj remain: • Align them and connect them in T • Choose some optimal alignment • This gives si and sj in the root of the two subtrees. • Backtrack through the subtrees • At each step: Align sk to the underlying SGs. • Choose some optimal alignment
The Deferred Path Heuristic: We defer our choice of actual sequences until the last moment, thereby enlarging our solution space: