Probabilistic Approaches to Phylogenies

Probabilistic Approaches to Phylogenies BMI/CS 576 www.biostat.wisc.edu/bmi576.html Sushmita Roy sroy@biostat.wisc.edu Oct 2nd, 2014

Readings • Chapter 8 • Sections 8.1, 8.2, 8.3, 8.4

Key concepts • Scoring a tree based on likelihood of observed data • How to compute the likelihood of sequences using a given tree and conditional probabilities • Felsenstein’s algorithm • General understanding of where the conditional probabilities are obtained from • General understanding of search strategies • (similar to parsimony)

Probabilistic methods for phylogenies • data: a set of n sequences • tree: a phylogenetic tree for the n sequences • Two approaches • Maximum likelihood framework • P(data|tree) • Bayesian framework • P(tree|data) • Both need a probabilistic model to compute the probability of a set of sequences, given a tree and branch lengths

Maximum likelihood methods for phylogenetic trees • “best” tree is the one that maximizes the likelihood of data (sequence) given model (Tree topology and branch length) • Phylogenetic tree construction requires • Scoring a tree • Requires computing the likelihood of sequence given the tree • That is computing the probability of the sequences given a tree • Searching the space of possible trees • Given a probabilistic model of sequence changes, we can compute the tree with the greatest likelihood (maximum likelihood)

Notation for computing the probability of sequences given a tree • xj:sequence at node j • xji: character in the ith position for the jth sequence • tj: length of the jth branch • T :tree topology • P(x|y,t): probability of switching from y toxfrom ancestor to child along a branch of length t • We will come back to defining such probabilistic models later

Computing the probability of sequences on a tree • If we know P(x|y,t) we can compute the probability of the sequences • This relies on a key assumption: • sequence at a child node iis independent of everything else given i’s parent • E.g. for a node i whose parent is given by α(i) • P(xi|xα(i),xj,xk..)=P(xi|xα(i)) • We will also make additional simplifying assumptions • We have an ungapped alignment • Characters at different sites evolve independently

Example of computing the probability of sequences given a tree t4 x5 Assume we are given the following tree for three sequences x1, x2 and x3 at the leaf nodes x4 t3 t2 t1 x2 x3 x1 The probability of these sequences given this tree is First, assume that the sites evolve independently, and there are a total on N sites Hereafter, let’s just focus on one site, u Also for clarity, we will use t for all branch lengths

Example continued t4 x5 x4 t3 t2 • The expression • Assume conditional independence, the above is • Written more compactly as • α(i) denotes the parent of i x2 x3 t1 x1

Example continued • Or more generally for n sequences at the leaves as Between internal nodes Between extant and internal nodes

But.. the ancestral sequences cannot be observed • So our probability calculation needs to sum over all ancestral states • Let us consider a simple example of two sequences t2 t1 xu2 xu1

Summing of ancestral state for a pair of sequences a t2 t1 xu2 xu1 qais the probability of observing character a at the root node 3

Generalizing this to n sequences • Requires us to sum over all of the internal nodes • α(i) gives the parent of I • aiis a variable storing the character at theithinternal node • Felsenstein’s algorithm gives an efficient way to compute this quantity Between internal nodes Between extant and internal nodes

Felsenstein’s algorithm • Input: Given a set of n sequences at the leaf nodes, conditional probability distribution of character switch, and a tree topology • Output: The likelihood of the sequences • Also based on Dynamic programming • Relies on computations performed in subtrees for computations at the root of these subtrees • Very similar ideas as in the Weighted Parsimony algorithm

Notation for Felsenstein’s algorithm • P(Lk|a): probability of the leaves below node k, given that the residue at k is a • i and j will denote the children of k • a, b, c characters at any node • We’ll drop the subscript u and work with only one site

Felsentein’s algorithm • Initialize: k=2n-1 • Recursion: • If k is a leaf node, • Else, compute P(Li|a) and P(Lj|a) for all a at daughters i and j • Termination • Likelihood at a site

An observation and a simplication • Note that • Further more, we will assume that the conditional probabilities are independent of branch length • Finally, assume qa=0.25 for all a in {A,T,G,C}

What is probability for the following set of residues b 5 4 1 2 3 a A T G Assume the above conditional probability matrix P(b|a) for all branches

In class exercise

The probabilities computed for each node Probability of sequence given tree is 0.25(0.0058+0.0022+0.0154 + 0.0058)=0.0073

Felsentein’s algorithm comments • Very similar to the weighted parsimony case • Main differences are at • Leaf nodes • Minimization versus summation for internal nodes • Can it be used to infer ancestral states as well? • Instead of summing, we would maximize • As in the parsimony case, we would need to keep track of the maximizing assignment

Errata • Slide 16 and slide 17 should have • Replace P(Li|a) by P(Li|b) • Slide 20 had typos.

Probabilistic Approaches to Phylogenies

Probabilistic Approaches to Phylogenies

Presentation Transcript

Correlating traits with phylogenies

Phylogenies

Reconstructing and Using Phylogenies

Probabilistic Approaches to Video Retrieval The Lowlands Team at TRECVID 2004

Building Phylogenies

Molecular phylogenies

Introduction to Phylogenies

Introduction to Phylogenies

Phylogenies

Information Theoretic Approach to Whole Genome Phylogenies

Probabilistic Machine Learning Approaches to Medical Classification Problems Chuan LU

Reconstructing and Using Phylogenies

Lecture 5 : Phylogenies

Summarising Sets of Phylogenies

Ultrametric phylogenies

Lousy Phylogenies:

Drawing phylogenies

Probabilistic Approaches to Phylogeny

SPECIES AND PHYLOGENIES

Building Phylogenies

PROBABILISTIC AND LOGIC APPROACHES TO MACHINE LEARNING AND DATA MINING

Building Phylogenies