630 likes | 756 Views
Evolutionary HMMs: a Bayesian approach to multiple alignment. Presented by: Ryan Cunningham. The goal. Input: Multiple sequences and a tree representing their evolutionary relationship
E N D
Evolutionary HMMs: a Bayesian approach to multiple alignment Presented by: Ryan Cunningham
The goal • Input: Multiple sequences and a tree representing their evolutionary relationship • Output: A multiple sequence alignment which maximizes the probability of the evolutionary relationships between the sequences, given the tree (and other parameters)
Evolutionary Models • Pairwise model for an ancestor aligned to a descendant, conditioned on time and other model parameters
Reversibility • A pairwise model is reversible if
Additivity • A pairwise model satisfies additive if
The goal • If these constraints are satisfied, then the goal can be expressed as
Model • Links Model • Pair HMM • Multiple HMM • Composition of branch alignments • Eliminating internal nodes
Links model • Models indels in a sequence • Each residue can either spawn a child or die • One “immortal link” can spawn residues from the left end of the sequence
Links model • Models indels in a sequence • Each residue can either spawn a child or die • One “immortal link” can spawn residues from the left end of the sequence
Links model • What is the time evolution of the probability of a link surviving and spawning n descendants?
Links model • What is the time evolution of the probability of a link surviving and spawning n descendants? Probability of deleting or inserting from a sequence of length n Probability of inserting into a sequence of length n-1 Probability of deleting from a sequence of length n+1
Links model • What is the time evolution of the probability of a link dying before time t and spawning n descendants?
Links model • What is the time evolution of the probability of the immortal link spawning n descendants at time t?
Links model • The solution to these differential equations is
Links model • Where…
Links model • Where… Probability the ancestral residue survives Probability of insertions from descendants Probability of insertions if the ancestral residue is dead
Pair HMM • Just like a standard HMM, but emits two sequences instead of one • Used to model aligned sequences
Pair HMM • Can be used to realize the links model outlined previously
Pair HMM • Can be used to realize the links model outlined previously • The path through the Pair HMM is pi
Pair HMM • This model is insufficient for our tasks for an obvious reason…
Pair HMM • This model is insufficient for our tasks for an obvious reason…
Multiple HMMs • Instead of emitting 2 sequences, emit N sequences • 2N-1 states! • Can develop such a model for any tree • Dynamic programming table for entire tree too large for practical use
Composition of branch alignments • Given a complete set of pairwise alignments for an evolutionary tree, how do we derive the multiple alignment (and vice versa)? • Rule for two sequences X and Y: Residues Xi and Yj are aligned iff • They are in the same column • That column contains no gaps for intermediate sequences
Composition of branch alignments • In other words, no deletion->insertions are allowed • Produces cliques of ungapped columns • Ignore columns which are all gaps in a clique
Composition of branch alignments • Example: 1 1: GATTACA 2: G-TTATA 3: GATAT-A 4: CATTA-T 2 4 3
Eliminating internal nodes • Sequences at internal nodes are not actually known • Ideally, they should be summed out of the final likelihood function (i.e. the probability of all possible sequences should be considered)
Eliminating internal nodes • This is not feasible for all possible indel histories • All of the branches become dependent • Substitution histories are possible • Given a multiple alignment • Post order traversal of the history • Compute the conditional likelihood for each clique • Like the Sankoff Algorithm discussed last semester • Call these characters “Felsenstein wildcards”
Algorithm • Sample from this distribution: • Sample space is too large (decomposition on the right won’t work) for complete indel history • Sample from a subset of alignments at each step using Gibbs sampling
Algorithm • Need to come up with a set of transformations on an alignment • These transformations need to be ergotic • Allow for transformation of any alignment into any other alignment • These transformations need to satisfy detailed balance • Gives convergence to desired stationary distribution
Move 1: parent sampling • Goal: align two sibling nodes Y and Z and infer their parent X • Construct the multiple HMM, fixing everything else • Sample an alignment of (Y,Z) using the forward algorithm • This imposes an alignment of (X,Z) and (Y,Z)
Move 2: branch sampling • Goal: align two adjacent nodes X and Y • Construct the pair HMM for X and Y, fixing everything else • Resample the alignment using the forward algorithm
Move 3: node sampling • Goal: resample the sequence at an internal node X • Given parent node W and children nodes Y and Z, fix everything but these branch alignments • Construct the multiple HMM and sample X • Note that we’re sampling the sequence, not an alignment
Sufficiency of moves • Can transform any alignment to any other alignment => ergodic • Samples from the conditional distributions, so this is Gibbs sampling => detailed balance • Together these imply the moves will produce an unbiased sample of the distribution
Algorithm • Construct a multiple alignment by parent sampling up the guide tree • Visit each node and each branch once in a random order and resample them • GOTO 2
Refinement 1: Greedy approach • Periodically save current alignment, then take a greedy approach • Just use Viterbi algorithm instead of forward algorithm for node and branch sampling • Store this and compare likelihood to other alignments at the end of the run
Refinement 2: Simulated annealing • Inject a variable “temperature” parameter T that decreases over time • Raise all probabilities to the 1/T power • Increases the “randomness” of the solutions early in the run to explore more of the search space
Refinement 3: Ordered over-relaxation • Sampling is a random walk, so follows Brownian motion • Would be better to avoid previously explored spaces • Accomplished without biasing the outputs through this technique
Refinement 3: Ordered over-relaxation • Impose a weak order on states • Emit some number N samples • Sort the N samples and the original sample by the specified weak ordering • The original sample ends up in position k, choose the (N-k)th sample for the next emission
Refinement 3: Ordered over-relaxation • For this algorithm, the weak ordering is a sort on “the centroid of all match states resolved transverse to the main diagonal of the dynamic programming matrix”
Implementation • Coded in C++ under GNU • Useful logging and parameter options • Very large package with helper scripts • Incorporated into the DART package
Simulated data • Yule process to create a random tree • Probabilistic process which creates an evolutionary tree • Number of links is distributed by a power law • Tfkemit to emit an alignment given a tree
Experiments on simulated data • Only looking at leaf sequences now • Tfkdistance program creates a distance (time) matrix from observed sequences • Uses “bracketed minimization” • Weighbor estimates a tree from a distance matrix • Uses neighbor joining, but weighted for longer distances according to a probabilistic model
Experiments on simulated data • Using the aligned leaf sequences, reconstruct the indel history using wildcards • Use maximum likelihood up the tree once • Use B to do greedy refinement (repeatedly sample up the tree until no improvements are made) • Use C to do 100 sampling moves from the Gibbs sampler, followed by greedy refinement