470 likes | 1.05k Views
Inferring phylogenetic trees: Distance and maximum likelihood methods. GENOME 373: Genomic Informatics Prof. William Stafford Noble. Outline. Distance methods Fitch-Margoliash Neighbor joining UPGMA Maximum likelihood. One-minute responses. Is the parsimony model biologically accurate?
E N D
Inferring phylogenetic trees: Distance and maximum likelihood methods GENOME 373: Genomic Informatics Prof. William Stafford Noble
Outline • Distance methods • Fitch-Margoliash • Neighbor joining • UPGMA • Maximum likelihood
One-minute responses • Is the parsimony model biologically accurate? • No. Parsimony ignores back-mutation, parallel mutation, etc. • The following tree can have a score of 2 or 3, correct? • Correct. However, the idea of parsimony is to select the tree with the smallest number of mutations along the tree. • Is it biologically acceptable to make the assumptions of the JC model? • No. The assumptions are made for statistical reasons – essentially, we often don’t know the proper values for the more parameter-rich models. • What other considerations can be taken to get a better tree? • The most important ones are site-by-site variation in mutation rate, and dependencies between adjacent sites. • Is there any way to check whether the tree obtained is significant? • You can check whether individual branches are significant using something called “bootstrap analysis.” • Still unclear how to use these trees in a biological way. • Primarily, these trees are used to understand evolutionary history. • Will we be using any of the phylogeny software in this class? • No.
One-minute responses • What’s a real event that is your “oracle” that tells you the true evolutionary history of substitutions for Jukes-Cantor? • There is no oracle, and luckily, you don’t need one in order for Jukes-Cantor to work. • It was difficult to understand how you were computing parsimony scores at first.
Distance methods • Fitch-Margoliash • Neighbor-joining • UPGMA Multiple sequence alignment Pairwise distance matrix Phylo- genetic tree
Star topology C B • Sum of all branches is S*=a+b+c+d+e. • Summing all distances in the matrix counts each edge four times (e.g., dAB, dAC, dAD and dAE). • Hence, the sum of all distances in the matrix is 4S*. b c a d A D e E
Adding one branch C C B b • Sum of branches is S = a + b + c + d + e + f = (dAC + dAD + dAE + dBC + dBD + dBE)/6 + dAB/2 + (dCD + dCE + dDE)/3 c c B b a d A f d D D a e e A E E
Neighbor joining • Add one branch to the star topology and compute the difference between S* and S. • Repeat for each pair of leaves in the tree. • Choose the pair that yields the largest difference (the closest neighbors). • Join that pair. • Repeat until all pairs are joined.
UPGMA • Unweighted pair group method with arithmetic mean. • Also known as agglomerative hierarchical clustering. • Basic idea: iteratively connect the two most closely related sequences.
UPGMA • Find the smallest off-diagonal element in the matrix.
UPGMA • Compute the average between the two rows and columns.
UPGMA • Each merger creates a subtree. Smik Sbay
Maximum likelihood for each possible tree for each column of the alignment compute the likelihood of the column, given the tree return the tree with the highest likelihood • Similar to parsimony, but capable of using a model of evolution. • Computationally expensive. • DNAML is the Phylip program for maximum likelihood. FastDNAML is a fast clone (http://geta.life.uiuc.edu/~gary/programs/fastDNAml.html).
Computing the likelihood • What is the probability of observing this column, given this tree and an assumed model of evolution? ACGCGTTGGG ACGCGTTGGG ACGCAATGAA ACACAGGGAA + Pr(column|tree,model) T T A G
Computing the likelihood • Solution: Enumerate all possible assignments to the internal nodes. Compute the probability of each tree, and sum. C G A A A A A A A T T T T A G T A G T A G
Computing the likelihood • What is the probability of observing this column, given this assigned tree and an assumed model of evolution? ACGCGTTGGG ACGCGTTGGG ACGCAATGAA ACACAGGGAA + A Pr(column|tree,model) T A T T A G
Computing the likelihood The probability of observing a substitution from A to T on a branch of length m is given by the evolutionary model. πA, πC, πG, πT The probability of the ancestral observation being A is just πA. A m T A T T A G
Computing the likelihood πA, πC, πG, πT • The desired probability is the product of the probabilities of the branches. • L(tree) = L0 L1 L2 L3 L4 L5 L6 L0 A L1 L2 T A L5 L3 L4 L6 T T A G
Computing the likelihood • The probability of the tree is the sum of the probabilities of the individual trees. • L(tree) = L(tree1) + L(tree2) + L(tree3) + … C G A A A A A A A T T T T A G T A G T A G tree1 tree2 tree3
Maximum likelihood revisited for each possible tree for each column of the alignment for each assignment of internal nodes for each branch compute the probability of that branch assigned tree probability ← multiply branch probabilities column probability ← sum assigned tree probabilities tree probability ← multiply column probabilities return the tree with the highest probability
Maximum likelihood revisited for each possible tree for each column of the alignment for each assignment of internal nodes for each branch compute the probability of that branch assigned tree probability ← multiply branch probabilities column probability ← sum assigned tree probabilities tree probability ← multiply column probabilities return the tree with the highest probability Multiply probabilities of independent events. Add probabilities of mutually exclusive events.
Overview • Parsimony • Distance methods • Computing distances • Finding the tree • Fitch-Margoliash • Neighbor-joining • UPGMA • Maximum likelihood