Generalized Tree Alignment: The Deferred Path Heuristic

Generalized Tree Alignment:The Deferred Path Heuristic Stinus Lindgreen stinus@diku.dk

Overview: What is a phylogeny? The Generalized Tree Alignment problem Sequence Graphs and their algorithms The Deferred Path Heuristic

Phylogeny: Describes evolutionary model • Common ancestor • Mutations happen all the time • Insertions, deletions, substitutions, translocations, inversions, duplications … Most mutations happen in DNA replication • Corrected by cell mechanisms Mutations accumulate → new species diverge Only mutations in sex cells are inherited (obviously)

Phylogeny: Phylogenetic inference: Given n sequences build a phylogenetic tree Most methods base T on a multiple alignment Likewise: Multiple alignments often based on guide trees Can we solve both problems at the same time?

Phylogeny: Describes the evolutionary relationship between species Notice root

Phylogeny: ... or among a single taxon (here, human entovirus 71)

The Problem: Given n sequences s1,…,sn … Multiple Alignment: Make an ordering A of the sequences by inserting gaps such that homologous bases are put in the same column Phylogenetic Inference: Build a (binary) tree T with s1,…,sn in the leaves and possible ancestors sn+1,…,sn+k in internal nodes describing their evolutionary connection

Generalized Tree Alignment: Combines the two. The problem we want to solve is: Given: A set of n sequences s1,…,sn from n different species (could be DNA, RNA or protein – for simplicity we focus on DNA) Problem: Generate an unrooted phylogenetic tree T with sequences s1,…,sn in the leaves and a multiple alignment A of these sequences Placing the root is not trivial and is best left to biologists.

The given problem is proven to be MAXSNP-hard (Wang and Jiang, 1994) → Not possible to find an approximation algorithm. Exact solutions to NP-hard problems are intractable → The best we can hope for is a heuristic The given algorithm runs in time O(n2.ln) • n: The number of sequences • l: Their maximum length.

Sequence graphs (Hein, 1989): Recall pairwise alignment. Traceback ”spells” possible optimal alignments:

Sequence graphs: Make graph with alignment columns as edge labels → represents all optimal alignments We will get back to that shortly … Right now, we want to represent sequences Let us introduce sequence graphs. For instance, s = ACTGTA is represented by:

Sequence graphs: More formally: • Directed, acyclic graph. • Edge labels lfrom alphabet Σ. Here, Σ={A,C,G,T,-} • Source s: The unique node with no incoming edges • Sink t: The unique node with no outgoing edges. • Each path from s to t spells a sequence.

Sequence graphs: Represents a set of sequences given by all paths from s to t:

Sequence graphs: Any single sequence can be represented by a linear sequence graph Any set of k sequences can be represented by making k paths from s to t A given sequence s’ can be represented by more than one path We can now represent sequences – but can we align them?

Aligning sequence graphs: Dynamic programming algorithm inspired by basic Pairwise Alignment: • Given two sequences p and q • Move one letter in p and move through q finding the optimal ”partial alignments” Sequence Graphs: • Given two sequence graphs G1 and G2 • We can have many outgoing edges to choose from

Aligning sequence graphs: Fill in a |V1|*|V2| score matrix For each pair of nodes i from G1 and j from G2: Should we: • Align the two characters we got by following e1 into i and e2 into j? • Stay in G1 and only move in G2? • Stay in G2 and only move in G1? • Or have we already found a better path into i and j?

Optimal Alignment Graphs: Now we need a way to remember the optimal alignments Recall graphs from before: • Directed, acyclic graphs • Nodes s and t defined as before • Edge labels of the form [la,lb] where la,lb∊Σ Backtrack through the matrix and consider each possible combination of edges.

Optimal Alignment Graphs: An example of an OAG: This one represents the alignments: We denote such a graph A* We have to convert the OAGs back to SGs

Optimal Alignment Graphs: This is done easily by considering the edge labels: If la= lb: Make a single edge in the SG with label la If la≠lb: Make two edges in the SG: One with label la and one with label lb The graph from before turns into the SG:

Summing up Sequence Graphs: Final graph represents all sequences giving an optimal alignment between G1 and G2 We can: • Represent a set of sequences by a sequence graph • Align two such graphs producing a new SG We can now get on with the main algorithm

The basic idea: • Start by comparing all sequences • Find a closest pair. • Represent all sequences giving the optimal solution • Defer the choice of a single sequence • Repeat, but this time include the set of sequences • In the end: Choose a single sequence and backtrack This shows a need for: • A compact representation of many sequences • An algorithm for aligning sets of sequences

The Deferred Path Heuristic: Similar to Kruskal’s algorithm for finding MSTs: From sequences s1,…,sn,initialize n SGs G1,…,Gn. Until only two SGs remain: • Align all pairs and choose a closest pair Gi and Gj • Create A*(Gi,Gj) and convert A* into a SG Gk. • Replace Gi and Gj with Gk Note that we remember all candidate sequences

The Deferred Path Heuristic: When only two SGs Gi and Gj remain: • Align them and connect them in T • Choose some optimal alignment • This gives si and sj in the root of the two subtrees. • Backtrack through the subtrees • At each step: Align sk to the underlying SGs. • Choose some optimal alignment

The Deferred Path Heuristic: We defer our choice of actual sequences until the last moment, thereby enlarging our solution space:

Generalized Tree Alignment: The Deferred Path Heuristic

Generalized Tree Alignment: The Deferred Path Heuristic

Presentation Transcript

Structured Data Extraction From Web Based on Partial Tree Alignment

Image alignment

Introduction to Generalized Linear Models

Junction Tree Algorithm

Deferred Lighting and Post Processing on PLAYSTATION®3

Heuristic Search

PATH PLANNING

Second – or third – thoughts on alignment

Generalized Linear Models Classification

Suffix Trees and their applications

7. Minimum Spanning Tree Problem

CSCE 580 Artificial Intelligence Ch.4: Informed (Heuristic) Search and Exploration

Multiple Alignment

Sequence Alignment

Evaluation of Tree Pattern Queries

Heuristic Search

Shaft Alignment

DETECTOR ALIGNMENT with tracks

Chapter 3 Graphs, Trees, and Tours