Evolutionary HMMs: a Bayesian approach to multiple alignment

Evolutionary HMMs: a Bayesian approach to multiple alignment Presented by: Ryan Cunningham

The goal • Input: Multiple sequences and a tree representing their evolutionary relationship • Output: A multiple sequence alignment which maximizes the probability of the evolutionary relationships between the sequences, given the tree (and other parameters)

Evolutionary Models • Pairwise model for an ancestor aligned to a descendant, conditioned on time and other model parameters

Reversibility • A pairwise model is reversible if

Additivity • A pairwise model satisfies additive if

The goal • If these constraints are satisfied, then the goal can be expressed as

Model • Links Model • Pair HMM • Multiple HMM • Composition of branch alignments • Eliminating internal nodes

Links model • Models indels in a sequence • Each residue can either spawn a child or die • One “immortal link” can spawn residues from the left end of the sequence

Links model • What is the time evolution of the probability of a link surviving and spawning n descendants?

Links model • What is the time evolution of the probability of a link surviving and spawning n descendants? Probability of deleting or inserting from a sequence of length n Probability of inserting into a sequence of length n-1 Probability of deleting from a sequence of length n+1

Links model • What is the time evolution of the probability of a link dying before time t and spawning n descendants?

Links model • What is the time evolution of the probability of the immortal link spawning n descendants at time t?

Links model • The solution to these differential equations is

Links model • Where…

Links model • Where… Probability the ancestral residue survives Probability of insertions from descendants Probability of insertions if the ancestral residue is dead

Pair HMM • Just like a standard HMM, but emits two sequences instead of one • Used to model aligned sequences

Pair HMM

Pair HMM • Can be used to realize the links model outlined previously

Pair HMM

Pair HMM • Can be used to realize the links model outlined previously • The path through the Pair HMM is pi

Pair HMM • This model is insufficient for our tasks for an obvious reason…

Multiple HMMs • Instead of emitting 2 sequences, emit N sequences • 2N-1 states! • Can develop such a model for any tree • Dynamic programming table for entire tree too large for practical use

Multiple HMMs

Composition of branch alignments • Given a complete set of pairwise alignments for an evolutionary tree, how do we derive the multiple alignment (and vice versa)? • Rule for two sequences X and Y: Residues Xi and Yj are aligned iff • They are in the same column • That column contains no gaps for intermediate sequences

Composition of branch alignments • In other words, no deletion->insertions are allowed • Produces cliques of ungapped columns • Ignore columns which are all gaps in a clique

Composition of branch alignments • Example: 1 1: GATTACA 2: G-TTATA 3: GATAT-A 4: CATTA-T 2 4 3

Eliminating internal nodes • Sequences at internal nodes are not actually known • Ideally, they should be summed out of the final likelihood function (i.e. the probability of all possible sequences should be considered)

Eliminating internal nodes • This is not feasible for all possible indel histories • All of the branches become dependent • Substitution histories are possible • Given a multiple alignment • Post order traversal of the history • Compute the conditional likelihood for each clique • Like the Sankoff Algorithm discussed last semester • Call these characters “Felsenstein wildcards”

Algorithm • Sample from this distribution: • Sample space is too large (decomposition on the right won’t work) for complete indel history • Sample from a subset of alignments at each step using Gibbs sampling

Algorithm • Need to come up with a set of transformations on an alignment • These transformations need to be ergotic • Allow for transformation of any alignment into any other alignment • These transformations need to satisfy detailed balance • Gives convergence to desired stationary distribution

Move 1: parent sampling • Goal: align two sibling nodes Y and Z and infer their parent X • Construct the multiple HMM, fixing everything else • Sample an alignment of (Y,Z) using the forward algorithm • This imposes an alignment of (X,Z) and (Y,Z)

Multiple HMM

Move 2: branch sampling • Goal: align two adjacent nodes X and Y • Construct the pair HMM for X and Y, fixing everything else • Resample the alignment using the forward algorithm

Pair HMM

Move 3: node sampling • Goal: resample the sequence at an internal node X • Given parent node W and children nodes Y and Z, fix everything but these branch alignments • Construct the multiple HMM and sample X • Note that we’re sampling the sequence, not an alignment

Multiple HMM

Sufficiency of moves • Can transform any alignment to any other alignment => ergodic • Samples from the conditional distributions, so this is Gibbs sampling => detailed balance • Together these imply the moves will produce an unbiased sample of the distribution

Algorithm • Construct a multiple alignment by parent sampling up the guide tree • Visit each node and each branch once in a random order and resample them • GOTO 2

Refinement 1: Greedy approach • Periodically save current alignment, then take a greedy approach • Just use Viterbi algorithm instead of forward algorithm for node and branch sampling • Store this and compare likelihood to other alignments at the end of the run

Refinement 2: Simulated annealing • Inject a variable “temperature” parameter T that decreases over time • Raise all probabilities to the 1/T power • Increases the “randomness” of the solutions early in the run to explore more of the search space

Refinement 3: Ordered over-relaxation • Sampling is a random walk, so follows Brownian motion • Would be better to avoid previously explored spaces • Accomplished without biasing the outputs through this technique

Refinement 3: Ordered over-relaxation • Impose a weak order on states • Emit some number N samples • Sort the N samples and the original sample by the specified weak ordering • The original sample ends up in position k, choose the (N-k)th sample for the next emission

Refinement 3: Ordered over-relaxation • For this algorithm, the weak ordering is a sort on “the centroid of all match states resolved transverse to the main diagonal of the dynamic programming matrix”

Implementation • Coded in C++ under GNU • Useful logging and parameter options • Very large package with helper scripts • Incorporated into the DART package

Simulated data • Yule process to create a random tree • Probabilistic process which creates an evolutionary tree • Number of links is distributed by a power law • Tfkemit to emit an alignment given a tree

Experiments on simulated data • Only looking at leaf sequences now • Tfkdistance program creates a distance (time) matrix from observed sequences • Uses “bracketed minimization” • Weighbor estimates a tree from a distance matrix • Uses neighbor joining, but weighted for longer distances according to a probabilistic model

Experiments on simulated data • Using the aligned leaf sequences, reconstruct the indel history using wildcards • Use maximum likelihood up the tree once • Use B to do greedy refinement (repeatedly sample up the tree until no improvements are made) • Use C to do 100 sampling moves from the Gibbs sampler, followed by greedy refinement

Evolutionary HMMs: a Bayesian approach to multiple alignment