410 likes | 729 Views
Fast Algorithms for Minimum Evolution. Richard Desper, NCBI Olivier Gascuel, LIRMM. Overview. Statement of phylogeny reconstruction problem and various approaches to solving it. Tree length formula as a function of average distances. Greedy algorithms for tree building and tree swapping.
E N D
Fast Algorithms for Minimum Evolution Richard Desper, NCBI Olivier Gascuel, LIRMM
Overview • Statement of phylogeny reconstruction problem and various approaches to solving it. • Tree length formula as a function of average distances. • Greedy algorithms for tree building and tree swapping. • Simulation results. • A few extras regarding consistency and branch lengths.
Phylogeny Reconstruction • General problem: reconstruct the evolutionary history for a set L of extant species. • Input: multiple sequence alignment for L or matrix of estimates of pairwise evolutionary distances. • Output: weighted phylogeny representing history of L and common ancestors.
Methods • Likelihood methods: model-based likelihood maximization. • Parsimony methods: minimize total number of mutations in tree. • Distance methods: fit tree structure to inferred evolutionary distances. Leading methods include Felsenstein-Fitch-Margoliash weighted least-squares and Neighbor-Joining and its variants.
Felsenstein-Fitch-Margoliash Least-squares Method • FITCH searches the space of topologies by iteratively adding leaves and by tree swapping. • Edge weights and topology are chosen to minimize the sum of squares (D is the input metric, DT is the induced tree metric): If sij= 1 for alliandj, this is called the ordinary least-squaresmethod.
Minimum Evolution • Developed by Rzhetsky and Nei (1992) as a modification of the OLS method • For each topology T, • Define function lassigning OLS lengths to edges of T • Define size of tree • Choose Tminimizing l(T )
For Recursive Definition of DA|B If A = {a}, B = {b}, DA|B = Dab, All average distances for all pairs of non-intersecting subtrees of a given topology can be calculated inO(n2)time.
A e i B External OLS Edge Length Function If e is the edge connecting the leaf i to the subtrees A and B,
A C e D B Internal OLS Edge Length Function The length of the edge eis (Vach, 1988) where
B A e D C Tree length formula Lemma: with T as to the right, let denote the root of subtree X, and the edge to X for Then,
With T as in prior slide, Tree Length Formula Using lemma and branch length formula for l(e),
General approach • To search the space of topologies, we’ll keep in memory two data structures: • Sizes of each subtree of given topology • Matrix of average distances DX|Y for X,Y disjoint subtrees in given topology • As we move from one topology to another, we’ll update the matrix, but only as much as needed, in an efficient manner.
B A C A e e D B D C Tree Swapping by NNI NNI swapping is a basic step in topology building and searching
With T as in prior slide, Tree Length Formula Using lemma and branch length formula for l(e),
Tree Length after NNI Given TgT’ the tree swap in prior slide, lthe edge length function: (1) wherel and l’ are constants depending on the topologies.
OLS: FASTNNI • Pre-compute average distances between non-intersecting sub-trees. (O(n2)computations) • Loop over all internal edges, select the best swap using Equation (1). (O(n)) • If no swap improves length of the tree, stop and return the tree, else perform the best swap and update the matrix of average distances and repeat Step 2. (O(n)per swap; there is only one new split.) Thus, if we require p swaps, the total complexity of FASTNNI is O(n2 + pn).
Balanced Minimum Evolution • Gascuel (2000) observed that the OLS/ME method was weaker than NJ in approximating the correct topology. • Pauplin (2000) to simplify tree length computation proposed to use a “balanced” version of Minimum Evolution, weighting each sub-tree equally when calculating averages: if A and B are sub-trees of T, with
BNNI • Calculate balanced averages of all pairs of sub-trees. (O(n2)) • Calculate improvement for each swap using (2) • If no tree swap improves length of the tree, stop and return tree, else update matrix of average distances and repeat Step 2. (O(ndiam(T)) per swap) The average complexity, when performing p swaps, is O(n2 + pn diam(T)).
y If we perform the B-C tree swap, then we must recalculate Typical values for diam(T): Yule-Harding distribution: Uniform distribution: Updating Subtree Averages T x X A C e Y B D Q: How many recalculations? (Hint: you can count (x,y) pairs). A: O(n diam(T))
Building trees from scratch We have NNI algorithms for OLS and balanced branch lengths. But what if we have no initial topology for NNIs?
OLS: Greedy Minimum Evolution • Start with three-taxon tree T3 • For k=4 to n, • Calculate Dk|A for each subtree A in Tk-1 • Express cost of inserting k along edgeeas f(e). (Use Equation (3) on the next slide.) • Choose e minimizing f. Insert k along eto form Tk. • Update matrix of average distances between every pair of 2-distant subtrees. GME runs in O(n2)running time
T’ C C T k k A A B B Greedy Minimum Evolution We use a variant of Equation (1), where D = {k}.Let L = l(T). Then
Balanced Minimum Evolution Same as GME,except: • (modifications) • Calculate balanced average distances instead of ordinary average distances • Use l = ½ to find weights for insertion points • Must keep average distances for all pairs of sub-trees. BME runs in O(n2 diam(T)) running time.
Simulations • Created 24- and 96-taxon trees, 2000 per each size, Yule-Harding process (g molecular clock). • Edge lengths multiplied by (1.0 + mX), where X is exponentially distributed. • Generated trees with three rates of evolution • SeqGen used to generate sequences for each tree and rate (12,000 in all) • DNADIST used to calculate distance matrices
Results: topological distances BNNI improved all input trees
Results: topological distances This improvement is large with fast rates and high numbers of taxa
Results: topological distances NNI trees are close to the best possible for BME
Results: topological distances The quality of the NNI tree is (mostly) independent of starting point
FASTNNI trees comparable to NJ as n grows to 96 Results: topological distances
Computational Times in (MM:SS) Computations done on Sun Enterprise E4500/E5500 running Solaris 8 on 10 400-Mhz processors with 7 Gb memory.
Average number of NNIs We see that the average number of NNIs is considerably lower than the number of taxa.
BME = WLS Why does the balanced approach work so well? • Pauplin’s formula for the length of a tree is • BME is a weighted least squares approach with Where pT(i,j) is the length of the (i,j) path in T. Distantly related taxa see their importance decrease exponentially.
Bonus features • BME is a consistent method. As observed distances converge to true distances, the true topology becomes the minimum evolution tree. • The BNNI tree has no negative branch lengths. A negative value to the branch length function implies a NNI leading to a smaller tree.
Consistency of Balanced ME • Theorem: SupposeS is a weighted tree, andTis a treetopologyincompatible withS. Let T be the tree of topology T with weights determined by the balanced scheme. Then l(T) > l(S). • Lemma: it suffices to prove the case when S is a split metric.
Balanced ME consistency • Basic idea: let l be the tree length function on the space of topologies. We find a sequence of topologies, T=T0, T1, ... Tk=S such that • Each Ti+1 can be reached from Ti via one of two simple topological transformations • l(Ti) > l(Ti+1) for all i. Proof structure modeled after OLS/ME proof (Rzhetsky and Nei, 1993).
D D A A C B B C Type I transformation Color the leaves black or white according to the split metric S. A Type I transformation uses a NNI to form a larger monochromatic cluster This transformation reduces the size of the tree under l
A1 A1 B1 A2 C C A2 B1 B2 B2 Type II transformation A Type II transformation uses two NNIs to form two monochromatic subtrees This transformation also reduces the value of the size of the tree under l
Positive Branch Lengths after BNNI Recall that the length of an edge is described by B D A e C We do not perform the switch because i.e. Thus Similarly,
Conclusions • BME + BNNI runs in O((n2 + pn)diam(T)), outputs trees comparable to (better than) FITCH, Weighbor, BioNJ, or NJ. • FastME is faster than NJ or its variants. • BNNI consistently improved output trees in all settings, even when WLS/Fitch trees were input. • BNNI outputs tree without negative branch lengths. • FASTME software available at http://www.ncbi.nlm.nih.gov/CBBResearch/Desper/FastME.html or http://www.lirmm.fr/~w3ifa/MAAS/.