200 likes | 395 Views
Distances. Correction for multiple changes. Distance Calculations. A 5 page excerpt from the book “Molecular Systematics” is on the course web page, as a PDF file DNA distances are more easily analyzed Only 4 letter alphabet More directly affected by mutation. Underlying mechanisms
E N D
Distances Correction for multiple changes
Distance Calculations • A 5 page excerpt from the book “Molecular Systematics” is on the course web page, as a PDF file • DNA distances are more easily analyzed • Only 4 letter alphabet • More directly affected by mutation
Underlying mechanisms • Jukes-Cantor model • Assumes all substitutions are equally probable • Uncorrected distance D = 1-proportion unchanged • Corrected distance = • Felsenstein,
Tree Searching Using Optimality Criteria Maximum Parsimony and Maximum Likelihood Methods
Searching for the Best Tree • A score for each tree can be evaluated using an objective function that uses the multiple alignment as a fixed parameter and varies the tree topology and branch lengths • Goal is to find the tree with the optimum score, which is defined as the “best” tree
Computational Problem • for small ntaxa, can evaluate the score for all trees and then pick the tree that gives the best score • as ntaxa increases full evaluation rapidly becomes impossible, number of trees and complexity of calculation for each tree both increase, so heuristics must be used
Distance Based • Can try to minimize the total tree length (Minimum Evolution = ME) by varying the internal branch lengths • This is a calculation that has to be performed for each tree topology, it is not an algorithm for constructing the tree
Maximum Parsimony • character based, not distance based • for a given tree all the character states of each homologous character can be reconstructed with some minimum number of changes on any given tree • if you sum the number of changes over all characters, you get tree length
you want to find the tree with the lowest score. This is called a Maximum Parsimony tree because it is based on the idea that the explanation that requires the fewest changes is the best • no analytical approach for this process, so you need algorithm that will a) evaluate tree length as fast as possible and b) search the tree-space with a high likelihood of evaluating the shortest tree
Three Options • Exhaustive - simply evaluates the length of every tree, therefore guaranteed to find the shortest tree(s) • Branch and Bound - searches tree space, but stops constructing a family of trees once length exceeds a pre-existing minimum, guaranteed to find shortest tree
Heuristic - constructs an approximately shortest tree, then does a series or rearrangements, evaluating length in each case, selecting the shortest tree from among the rearrangements, and iterating until a shorter tree is not found • usually works well, but certain data sets will give an incorrect answer
Informative Sites • sites at which at least two character states appear at least twice • reason - single appearance of any character state is most parsimoniously explained as a change at the end of the graph
Example • consider a four taxon set of data, three possible trees, one character • Taxon 1 = G • Taxon 2 = A • Taxon 3 = A • Taxon 4 = G
Homoplasy • when you are considering more than one character, they may not all be consistent with the same tree • principle of maximum parsimony says that you pick the tree with the lowest number of homoplasies, multiple independent origins of a character state
More Complex Trees • there is an algorithm for finding the lowest score attributable to any distribution of character states on any bifurcating tree • trace back from terminal taxa to each node, define the nodal state as the intersection set of the two descendants, unless the intersection is null, in which case, define as the union
each time a union is required, that adds to the score, because one descendant of the union must have changed • Repeat the process going from scored nodes to unscored nodes • for each tree, perform the same analysis for all characters and sum the scores; that number is the tree score • the tree with the shortest score is most parsimonious
Heuristic Search • For exhaustive search or branch and bound the search algorithm covers all possible trees • For heuristic search need to define a non-exhaustive search algorithm • Most commonly used is tree bisection-reconnection (TBR)
Can bisect any tree at any of the branches, creating two sub-trees, then reconnect by joining any pair of branches from each tree • If all the trees that are generated by a cycle of TBR are not shorter than the parent tree, then the parent tree is accepted as the shortest tree • If one of the TBR-generated trees is shorter, then it is taken as the next candidate shortest tree, and is in turn subjected to a round of TBR analysis