220 likes | 350 Views
Probabilistic Approaches to Phylogenies. BMI/CS 576 www.biostat.wisc.edu/bmi576.html Sushmita Roy sroy@biostat.wisc.edu Oct 2 nd , 2014. Readings. Chapter 8 Sections 8.1, 8.2, 8.3, 8.4. Key concepts. Scoring a tree based on likelihood of observed data
E N D
Probabilistic Approaches to Phylogenies BMI/CS 576 www.biostat.wisc.edu/bmi576.html Sushmita Roy sroy@biostat.wisc.edu Oct 2nd, 2014
Readings • Chapter 8 • Sections 8.1, 8.2, 8.3, 8.4
Key concepts • Scoring a tree based on likelihood of observed data • How to compute the likelihood of sequences using a given tree and conditional probabilities • Felsenstein’s algorithm • General understanding of where the conditional probabilities are obtained from • General understanding of search strategies • (similar to parsimony)
Probabilistic methods for phylogenies • data: a set of n sequences • tree: a phylogenetic tree for the n sequences • Two approaches • Maximum likelihood framework • P(data|tree) • Bayesian framework • P(tree|data) • Both need a probabilistic model to compute the probability of a set of sequences, given a tree and branch lengths
Maximum likelihood methods for phylogenetic trees • “best” tree is the one that maximizes the likelihood of data (sequence) given model (Tree topology and branch length) • Phylogenetic tree construction requires • Scoring a tree • Requires computing the likelihood of sequence given the tree • That is computing the probability of the sequences given a tree • Searching the space of possible trees • Given a probabilistic model of sequence changes, we can compute the tree with the greatest likelihood (maximum likelihood)
Notation for computing the probability of sequences given a tree • xj:sequence at node j • xji: character in the ith position for the jth sequence • tj: length of the jth branch • T :tree topology • P(x|y,t): probability of switching from y toxfrom ancestor to child along a branch of length t • We will come back to defining such probabilistic models later
Computing the probability of sequences on a tree • If we know P(x|y,t) we can compute the probability of the sequences • This relies on a key assumption: • sequence at a child node iis independent of everything else given i’s parent • E.g. for a node i whose parent is given by α(i) • P(xi|xα(i),xj,xk..)=P(xi|xα(i)) • We will also make additional simplifying assumptions • We have an ungapped alignment • Characters at different sites evolve independently
Example of computing the probability of sequences given a tree t4 x5 Assume we are given the following tree for three sequences x1, x2 and x3 at the leaf nodes x4 t3 t2 t1 x2 x3 x1 The probability of these sequences given this tree is First, assume that the sites evolve independently, and there are a total on N sites Hereafter, let’s just focus on one site, u Also for clarity, we will use t for all branch lengths
Example continued t4 x5 x4 t3 t2 • The expression • Assume conditional independence, the above is • Written more compactly as • α(i) denotes the parent of i x2 x3 t1 x1
Example continued • Or more generally for n sequences at the leaves as Between internal nodes Between extant and internal nodes
But.. the ancestral sequences cannot be observed • So our probability calculation needs to sum over all ancestral states • Let us consider a simple example of two sequences t2 t1 xu2 xu1
Summing of ancestral state for a pair of sequences a t2 t1 xu2 xu1 qais the probability of observing character a at the root node 3
Generalizing this to n sequences • Requires us to sum over all of the internal nodes • α(i) gives the parent of I • aiis a variable storing the character at theithinternal node • Felsenstein’s algorithm gives an efficient way to compute this quantity Between internal nodes Between extant and internal nodes
Felsenstein’s algorithm • Input: Given a set of n sequences at the leaf nodes, conditional probability distribution of character switch, and a tree topology • Output: The likelihood of the sequences • Also based on Dynamic programming • Relies on computations performed in subtrees for computations at the root of these subtrees • Very similar ideas as in the Weighted Parsimony algorithm
Notation for Felsenstein’s algorithm • P(Lk|a): probability of the leaves below node k, given that the residue at k is a • i and j will denote the children of k • a, b, c characters at any node • We’ll drop the subscript u and work with only one site
Felsentein’s algorithm • Initialize: k=2n-1 • Recursion: • If k is a leaf node, • Else, compute P(Li|a) and P(Lj|a) for all a at daughters i and j • Termination • Likelihood at a site
An observation and a simplication • Note that • Further more, we will assume that the conditional probabilities are independent of branch length • Finally, assume qa=0.25 for all a in {A,T,G,C}
What is probability for the following set of residues b 5 4 1 2 3 a A T G Assume the above conditional probability matrix P(b|a) for all branches
The probabilities computed for each node Probability of sequence given tree is 0.25(0.0058+0.0022+0.0154 + 0.0058)=0.0073
Felsentein’s algorithm comments • Very similar to the weighted parsimony case • Main differences are at • Leaf nodes • Minimization versus summation for internal nodes • Can it be used to infer ancestral states as well? • Instead of summing, we would maximize • As in the parsimony case, we would need to keep track of the maximizing assignment
Errata • Slide 16 and slide 17 should have • Replace P(Li|a) by P(Li|b) • Slide 20 had typos.