creativecommons/licenses/by-sa/2.0/

http://creativecommons.org/licenses/by-sa/2.0/

CIS786, Lecture 4 Usman Roshan

Iterated local search: escape local optima by perturbation Local search Local optimum Perturbation Local search Output of perturbation

ILS for MP • We saw that ratchet improves upon iterative improvement • We saw that TNT’s sophisticated and faster implementation outperforms ratchet and PAUP* implementations • But can we do even better?

Disk Covering Methods (DCMs) • DCMs are divide-and-conquer booster methods. They divide the dataset into small subproblems, compute subtrees using a given base method, merge the subtrees, and refine the supertree. • DCMs to date • DCM1: for improving statistical performance of distance-based methods. • DCM2: for improving heuristic search for MP and ML • DCM3: latest, fastest, and best (in accuracy and optimality) DCM

2. Compute subtrees using a base method 1. Decompose sequences into overlapping subproblems 3. Merge subtrees using the Strict Consensus Merge (SCM) 4. Refine to make the tree binary DCM2 technique for speeding up MP searches

DCM1 and DCM2 decompositions DCM2 decomposition: Getting a smaller number of smaller subproblems speeds up solution DCM1 decomposition : NJ gets better accuracy on small diameter subproblems

Supertree Methods

1 2 1 2 3 3 2 5 5 4 4 6 2 1 1 6 1 2 1 2 3 3 4 4 7 3 3 4 4 7 Strict Consensus Merger

e f g a b h d f g a c d c f b h g a e g a f e c h b c d b h d e Tree Refinement

The big question Why DCMs? Can DCMs improve upon existing Methods such as neighbor-joining or PAUP* or TNT?

Improving sequence length requirements of NJ • Can DCM1 improve upon NJ? • We examine this question under simulation

DCM1(NJ)

Computing tree for one threshold

Recall simulation studies

Experimental results • True tree selection (phase II of DCM1) • Uniformly random trees • Birth-death random trees • Sequence length requirements on birth-death random trees

Comparing tree selection techniques

Error rates on uniform random trees

Error as a function of evolutionary rate NJ DCM1-NJ+MP

Sequence length requirements as a function of evolutionary rates 100 taxa, 90% accuracy

Sequence length requirements as a function of evolutionary rates 400 taxa, 90% accuracy

Sequence length requirements as a function of #taxa DCM1-NJ+MP NJ

Conclusion • DCM1-NJ+MP improves upon NJ on large and divergent settings • Why did it work? • Smaller datasets with low evolutionary diameters AND reliable supertree method  accurate subtrees (on subsets)  accurate supertree

Conclusion • DCM1-NJ+MP improves upon NJ on large and divergent settings • Why did it work? • Smaller datasets with low evolutionary diameters AND reliable supertree method  accurate subtrees (on subsets)  accurate supertree • But can we improve upon MP heuristics, particularly on large datasets?

Previously we saw a comparison of DCM components for solving MP • DCM2 better than DCM1 decomposition • SCM better than MRP (in DCM context) • Constrained refinement better than Inferred Ancestral States technique • Higher thresholds take longer but can produce better trees

Comparison of DCM components for solving MP • DCM2 better than DCM1 decomposition • SCM better than MRP (in DCM context) • Constrained refinement better than Inferred Ancestral States technique • Higher thresholds take longer but can produce better trees • Can DCM2 improve over TNT? (TNT is state of the art in solving MP---very fast routines for TBR)

I. Comparison of DCMs (1,322 sequences) Base method is the TNT-ratchet.

I. Comparison of DCMs (4583 sequences) Base method is the TNT-ratchet.

I. Comparison of DCMs (4583 sequences) Base method is the TNT-ratchet. DCM2 takes almost 10 hours to produce a tree and is too slow to run on larger datasets.

DCM2 decomposition on 500 rbcL genes (Zilla dataset) • DCM2 decomposition • Blue: separator • Red: subset 1 • Pink: subset 2 • Vizualization produced by • graphviz program---draws • graph according to specified • distances. • Nodes: species in the dataset • Distances: p-distances • (hamming) between the DNAs • Separator is very large • Subsets are very large • Scattered subsets

Doesn’t look anything like this

DCM3 decomposition • DCM2 • Input: distance matrix d, threshold , sequences S • Algorithm: • 1a. Compute a threshold graph G using q and d • 1b. Perform a minimum weight triangulation of G DCM3 • Input: guide-tree T on S, sequences S • Algorithm: • Compute a short quartet graph G using T. The graph G is provably triangulated. • Find separator X in G which minimizes max • where are the connected components of G – X • Output subproblems as . DCM3 advantage: it is faster and produces smaller subproblems than DCM2

DCM3 decomposition - example

Approx centroid-edge DCM3 decomposition – example • Locate the centroid edge e (O(n) time) • Set the closest leaves around e to be the separator (O(n) time) • Remaining leaves in subtrees around e form the subsets (unioned with the separator)

Time to compute DCM3 decompositions • An optimal DCM3 decomposition takes O(n 3) to compute – same as for DCM2 • The centroid edge DCM3 decomposition can be computed in O(n 2) time • An approximate centroid edge decomposition can be computed in O(n ) time (from hereon we assume we are using the approximate centroid edge decomposition)

DCM2 decomposition on 500 rbcL genes (Zilla dataset) • DCM2 decomposition • Blue: separator • Red: subset 1 • Pink: subset 2 • Vizualization produced by • graphviz program---draws • graph according to specified • distances. • Nodes: species in the dataset • Distances: p-distances • (hamming) between the DNAs • Separator is very large • Subsets are very large • Scattered subsets

DCM3 decomposition on 500 rbcL genes (Zilla dataset) • DCM3 decomposition • Blue: separator (and subset) • Red: subset 2 • Pink: subset 3 • Yellow: subset 4 • Vizualization produced by graphviz • program---draws graph according to • specified distances. • Nodes: species in the dataset • Distances: p-distances • (hamming) between the DNAs • Separator is small • Subsets are small • Compact subsets

0.30 0.25 Average MP 0.20 score above optimal, 0.15 shown as a percentage of the optimal 0.10 0.05 0.00 0 4 8 12 16 20 24 Hours Comparison of DCMs TNT DCM2 DCM3 Rec-DCM3 • Dataset: 4583 actinobacteria ssu rRNA from RDP. Base method is the TNT-ratchet. • DCM2 takes almost 10 hours to produce a tree and is too slow to run on larger datasets. • DCM3 followed by TNT-ratchet doesn’t improve over TNT • Recursive-DCM3 followed by TNT-ratchet doesn’t improve over TNT

Local optimum Cost Global optimum Phylogenetic trees Local optima is a problem

Local optima is a problem Average MP score above optimal, shown as a percentage of the optimal Hours

Iterated local search: escape local optima by perturbation Local search Local optimum Perturbation Local search Output of perturbation

Iterated local search: Recursive-Iterative-DCM3 Local search Local optimum Recursive-DCM3 Local search Output of Recursive-DCM3

TNT DCM2 DCM3 Rec-DCM3 Rec-I-DCM3 0.30 0.25 Average MP 0.20 score above optimal, 0.15 shown as a percentage of the optimal 0.10 0.05 0.00 0 4 8 12 16 20 24 Hours Comparison of DCMs for solving MP Rec-I-DCM3(TNT-ratchet) improves upon unboosted TNT-ratchet

I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet.

I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet. Note the improvement in DCMs as we move from the default to recursion to iteration to recursion+iteration.

creativecommons/licenses/by-sa/2.0/