150 likes | 361 Views
Probabilistic methods for phylogenetic trees (Part 2). Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ sroy@biostat.wisc.edu Oct 7 th , 2014. RECAP. Probabilistic methods for phylogenetic tree construction P( data|tree ) Maximum likelihood
E N D
Probabilistic methods for phylogenetic trees (Part 2) Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ sroy@biostat.wisc.edu Oct 7th, 2014
RECAP • Probabilistic methods for phylogenetic tree construction • P(data|tree) • Maximum likelihood • Felsenstein algorithm for computing the likelihood of a sequence given a tree
Probabilistic models of evolution • The probability of a character switching from a to b along a branch of length t, P(b|a,t) is captured by the matrix • For example for DNA this is:
Defining the conditional probability distributions • If we consider t to be evolutionary time, these conditional probabilities can be obtained what is called a continuous time Markov process • Such processes are defined by a K-by-Krate matrix R • Each entry of R,R(a,b)gives a rate of substitution from a to b • The time spent in any state (character) is exponentially distributed • If we have R, S(t) can be obtained from R • Using the theory of continuous time Markov processes
Rate matrices • A rate matrix R • Is a K-by-K matrix where K is the size of our alphabet • E.g. for DNA K=4 • Different rate matrices make different assumptions of substitutions • Jukes Cantor: all substitutions have same rates. • Kimura: transitions (A<->G, C<->T) and transversions (A<->C,A<->T,G<->C,G<->T) have different rates. • Hasegawa, Kishino, Yano (HKY, all substitutions have different rates).
Jukes Cantor Rate matrix • Simplest possible rate matrix forDNA sequence evolution • Assumes all bases change at the same rate A T G C A T G C
Conditional probabilities from Jukes Cantor • The conditional probability matrix, P(a|b,t) has a similar form as the rate matrix A T G C A T G C P(G|C,t) Equilibrium distribution: ¼ for all bases
Searching phylogenetic tree space with maximum likelihood • As in the maximum parsimony case we need to • Score a tree • Search over the space of possible trees • Score a given tree • Branch lengths are parameters • Estimate the branch lengths to maximize the likelihood of data given tree • Search over trees • Start with an initial tree • A greedy approach of adding a branch that maximizes the likelihood • Neighbor Joining • Revisit using nearest neighbor interchange or subtree grafting approaches until convergence
Some advantages of probabilistic approaches • Probabilistic models can be naturally extended to more realistic model • Model site specific parameters • Model gaps • A probabilistic framework can be used to evaluate different models of varying complexity (more parameters) • Different evolutionary models • Easily combined with other probabilistic models • Hidden Markov models
Modeling site-specific parameters • Recall we had assumed that the probabilities at each is the same • This could be relaxed by introducing additional parameters per site, ru
Probabilistic interpretation of Parsimony • Recall P(a|b,t) is the key quantity of interest • Replace P(a|b,t)by P(a|b) and use –log P(a|b) as the score • Applying the weighted parsimony algorithm on this score to get the minimal cost tree will give an approximation to likelihood • The one associated with the most likely assignment of the ancestral states
Bootstrap: Assessing reliability of phylogenetic trees • Bootstrap: a computational strategy used to assess confidence in an estimated quantity • E.g. branch length • Tree branching topology • Generate a bunch of trees, {T1,…,TN}, from N random samples of the data • Sample columns/sites with replacement • Reconstruct a tree from sampled columns • One can estimate the confidence of any tree feature based on the proportion of times the feature is seen in a tree in {T1,…,TN}
Example of bootstrap Ziheng Yang and Bruce Rannala, Nature Reviews Genetics 2012
Some common phylogenetic tree construction algorithms • PhyML • Maximum likelihood, Nearest neighbor interchange, subtree pruning and regrafting • RAxML (Randomized Axelerated Maximum Likelihood) • Exists in both sequential and parallel versions • Also does subtree pruning and regrafting • PhyLIP (From Felsenstein) • Package for distance-based, parsimony, ML methods • BEAST (Bayesian) • MCMC based sampling • MrBayes (Bayesian) • MCMC based sampling • Visit here for more http://evolution.genetics.washington.edu/phylip/software.html
Comments about phylogenetic tree construction • Which method to pick? • Neighbor joining: fast, constructs right tree if the distances are additive • Parsimony: does not make any assumption of distances • Probabilistic: • More principled, provides a systematic framework to estimate model parameters • Enables us to quantify uncertainty in the model, evaluate different models of evolution • If ML distances are additive NJ can construct the right tree • If branch lengths are ignored, weighted parsimony and maximum likelihood are equivalent • Search space may be large, but • can find the optimal tree efficiently in some cases • heuristic methods can be applied • Difficult to evaluate inferred phylogenies: ground truth not usually known • can look at agreement across different sources of evidence • can look at repeatability across subsamples of the data (bootstrap) • can look at indirect predictions, e.g. conservation of sites in proteins • Methods could be assessed based on a simulation framework based on a probabilistic model of phylogenies • Phylogenies for bacteria, viruses not so straightforward because of lateral transfer of genetic material