1 / 24

Generalized Tree Alignment: The Deferred Path Heuristic

Generalized Tree Alignment: The Deferred Path Heuristic. Stinus Lindgreen stinus@diku.dk. Overview: What is a phylogeny? The Generalized Tree Alignment problem Sequence Graphs and their algorithms The Deferred Path Heuristic. Phylogeny: Describes evolutionary model Common ancestor

jude
Download Presentation

Generalized Tree Alignment: The Deferred Path Heuristic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Generalized Tree Alignment:The Deferred Path Heuristic Stinus Lindgreen stinus@diku.dk

  2. Overview: What is a phylogeny? The Generalized Tree Alignment problem Sequence Graphs and their algorithms The Deferred Path Heuristic

  3. Phylogeny: Describes evolutionary model • Common ancestor • Mutations happen all the time • Insertions, deletions, substitutions, translocations, inversions, duplications … Most mutations happen in DNA replication • Corrected by cell mechanisms Mutations accumulate → new species diverge Only mutations in sex cells are inherited (obviously)

  4. Phylogeny: Phylogenetic inference: Given n sequences build a phylogenetic tree Most methods base T on a multiple alignment Likewise: Multiple alignments often based on guide trees Can we solve both problems at the same time?

  5. Phylogeny: Describes the evolutionary relationship between species Notice root

  6. Phylogeny: ... or among a single taxon (here, human entovirus 71)

  7. The Problem: Given n sequences s1,…,sn … Multiple Alignment: Make an ordering A of the sequences by inserting gaps such that homologous bases are put in the same column Phylogenetic Inference: Build a (binary) tree T with s1,…,sn in the leaves and possible ancestors sn+1,…,sn+k in internal nodes describing their evolutionary connection

  8. Generalized Tree Alignment: Combines the two. The problem we want to solve is: Given: A set of n sequences s1,…,sn from n different species (could be DNA, RNA or protein – for simplicity we focus on DNA) Problem: Generate an unrooted phylogenetic tree T with sequences s1,…,sn in the leaves and a multiple alignment A of these sequences Placing the root is not trivial and is best left to biologists.

  9. The given problem is proven to be MAXSNP-hard (Wang and Jiang, 1994) → Not possible to find an approximation algorithm. Exact solutions to NP-hard problems are intractable → The best we can hope for is a heuristic The given algorithm runs in time O(n2.ln) • n: The number of sequences • l: Their maximum length.

  10. Sequence graphs (Hein, 1989): Recall pairwise alignment. Traceback ”spells” possible optimal alignments:

  11. Sequence graphs: Make graph with alignment columns as edge labels → represents all optimal alignments We will get back to that shortly … Right now, we want to represent sequences Let us introduce sequence graphs. For instance, s = ACTGTA is represented by:

  12. Sequence graphs: More formally: • Directed, acyclic graph. • Edge labels lfrom alphabet Σ. Here, Σ={A,C,G,T,-} • Source s: The unique node with no incoming edges • Sink t: The unique node with no outgoing edges. • Each path from s to t spells a sequence.

  13. Sequence graphs: Represents a set of sequences given by all paths from s to t:

  14. Sequence graphs: Any single sequence can be represented by a linear sequence graph Any set of k sequences can be represented by making k paths from s to t A given sequence s’ can be represented by more than one path We can now represent sequences – but can we align them?

  15. Aligning sequence graphs: Dynamic programming algorithm inspired by basic Pairwise Alignment: • Given two sequences p and q • Move one letter in p and move through q finding the optimal ”partial alignments” Sequence Graphs: • Given two sequence graphs G1 and G2 • We can have many outgoing edges to choose from

  16. Aligning sequence graphs: Fill in a |V1|*|V2| score matrix For each pair of nodes i from G1 and j from G2: Should we: • Align the two characters we got by following e1 into i and e2 into j? • Stay in G1 and only move in G2? • Stay in G2 and only move in G1? • Or have we already found a better path into i and j?

  17. Optimal Alignment Graphs: Now we need a way to remember the optimal alignments Recall graphs from before: • Directed, acyclic graphs • Nodes s and t defined as before • Edge labels of the form [la,lb] where la,lb∊Σ Backtrack through the matrix and consider each possible combination of edges.

  18. Optimal Alignment Graphs: An example of an OAG: This one represents the alignments: We denote such a graph A* We have to convert the OAGs back to SGs

  19. Optimal Alignment Graphs: This is done easily by considering the edge labels: If la= lb: Make a single edge in the SG with label la If la≠lb: Make two edges in the SG: One with label la and one with label lb The graph from before turns into the SG:

  20. Summing up Sequence Graphs: Final graph represents all sequences giving an optimal alignment between G1 and G2 We can: • Represent a set of sequences by a sequence graph • Align two such graphs producing a new SG We can now get on with the main algorithm

  21. The basic idea: • Start by comparing all sequences • Find a closest pair. • Represent all sequences giving the optimal solution • Defer the choice of a single sequence • Repeat, but this time include the set of sequences • In the end: Choose a single sequence and backtrack This shows a need for: • A compact representation of many sequences • An algorithm for aligning sets of sequences

  22. The Deferred Path Heuristic: Similar to Kruskal’s algorithm for finding MSTs: From sequences s1,…,sn,initialize n SGs G1,…,Gn. Until only two SGs remain: • Align all pairs and choose a closest pair Gi and Gj • Create A*(Gi,Gj) and convert A* into a SG Gk. • Replace Gi and Gj with Gk Note that we remember all candidate sequences

  23. The Deferred Path Heuristic: When only two SGs Gi and Gj remain: • Align them and connect them in T • Choose some optimal alignment • This gives si and sj in the root of the two subtrees. • Backtrack through the subtrees • At each step: Align sk to the underlying SGs. • Choose some optimal alignment

  24. The Deferred Path Heuristic: We defer our choice of actual sequences until the last moment, thereby enlarging our solution space:

More Related