1 / 33

Barking Up the Wrong Treelength

Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009. Barking Up the Wrong Treelength. Minimizing Treelength. Generalized Input: set S of sequences and a function f(s, s') for the edit distance between sequences s and s'

wolfe
Download Presentation

Barking Up the Wrong Treelength

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009 Barking Up the Wrong Treelength

  2. Minimizing Treelength • Generalized • Input: set S of sequences and a function f(s, s') for the edit distance between sequences s and s' • Output: A tree T, leaf-labelled by set S, with additional sequences labelling the internal nodes of T, so as to minimize treelength (total edit distance on the edges of the tree) • Fixed Tree variant

  3. POY • POY (from the American Museum of Natural History, Ward Wheeler and colleagues) is the main software for this. • Minimizing treelength is also known as “Direct Optimization” • POY has passionate adherents who believe in treelength • POY also has been heavily criticized

  4. POY • Input: set S of sequences (unaligned), gap-open cost, gap-extend cost, and transition/transversion ratio • Default settings for gap-open and gap-extend in POY are “simple” (gap-open cost is 0) • POY can also be used to score a fixed input tree under the desired treelength definition.

  5. Ogden and Rosenberg 2007 • Ogden and Rosenberg study compared POY 3.0 to MP(ClustalW) • Model conditions – mostly 16 taxa (some 64 taxon trees), K2P substitution model, short gaps (expected length 4)‏ • Optimization Problem – Multiple edit distances, all on simple gap penalties (gap-open cost is 0) • Performance metrics • Tree errors • Alignment errors • No mention of treelength • Result: MP(ClustalW) much more accurate than POY

  6. O&R concluded that Treelength is BAD! • O&R simulation study showed that POY alignments worse than ClustalW more than 99% of the time, and POY trees less accurate than ClustalW on average. • “Therefore, traditional multiple sequence alignment approaches appear to vastly outperform direct optimization-like approaches in terms of alignment accuracy, at least for the data sets and parameter settings that have been examined thus far.” • Ogden and Rosenberg 2007

  7. Treelength is BAD! • “Although our data represents a fairly simple case, for data sets similar to these the traditional two-step approach will almost always give a more accurate alignment and will most likely recover equally or more accurate phylogenetic relationships than direct optimization as implemented in POY.” • Ogden and Rosenberg 2007

  8. Our question Does minimizing treelength work poorly in general, or Is it minimizing treelength under simple gap penalties that works poorly?

  9. Gap penalties • Simple: a gap of length k costs kC • Affine: a gap of length k costs Copen+kCextend • Other types of penalties are possible

  10. “Treelength not so bad!”(paraphrasing Liu et al 2009)‏ Liu et al. 2009 show • Treelength can be a good criterion, if based upon affine gap penalty • We developed POY*: a version of POY which uses: • a particular affine gap penalty, • and a particular starting tree

  11. Our Study 2008 • Our study compares POY 4.0 to multiple methods • Model conditions – 25and 100 taxa, GTR+Gamma for the substitution model, short and long gaps • Optimization Problem – Multiple edit distances, based upon both simple and affine gap penalties • Results • Tree error • Alignment error • Treelength

  12. Gap cost functions we studied • Simple1 – all mismatches and indels cost 1 • Simple2 – indels cost 2, transversions cost 2 and transitions cost 1 • Affine – gap of length k costs 4 + k, transversions cost 2, and transitions cost 1

  13. Simulation Study Overview • Model trees • Birth-death • Deviation from ultrametricity • Sequence evolution • Estimation of trees and alignments • Statistics

  14. Simulation Study Overview • Model trees • Sequence evolution • GTR model of evolution from Tree of Life project • Gamma-distributed rates across sites • Gap model • Estimation of trees and alignments • Statistics

  15. Simulation Study Overview • Model trees • Sequence evolution • Estimation of trees and alignments • POY • POY* - POY with particular starting tree (Probtree, using a particular Affine gap penalty • Several two-phase methods (best alignments followed by MP and ML)‏ • PS (POY-score) on various trees • Statistics

  16. Simulation Study Overview • Model trees • Sequence evolution • Estimation of trees and alignments • Statistics • Alignment error • Tree error • Treelength under each gap cost function

  17. Simulation Study Model Conditions • 4 model conditions • 80 replicate datasets apiece • Different numbers of taxa allow us to explore taxonomic sampling effects

  18. Results – Alignment Errors • Simple vs. affine penalties • Note: story changes for affine penalties, especially on long gap event distribution

  19. Alignment Error: ClustalW vs. POY* • POY* better than ClustalW over 50% in (b), and 90% of time under (a)‏ • Compare with Ogden and Rosenberg, who find ClustalW better than POY 99.9% of time

  20. Results – Alignment Errors • PS is POY used to estimate alignments on various trees • Note: PS produces worse alignments than ClustalW if simple gap cost functions are used, even if applied to the true tree‏

  21. Tree error POY and POY* both use the same gap penalty (affine) Results shown on 100 taxon short gap simulated datasets (results for other models similar)‏

  22. Tree Error POY and POY* both use the same gap penalty (affine) Results shown on 100 taxon short gap simulated datasets (results for other models similar)‏

  23. Tree error POY and POY* both use the same gap penalty (affine) Results shown on 100 taxon short gap simulated datasets (results for other models similar)‏

  24. How well does POY solve its optimization problem? • We examine the treelength found by POY for various model conditions • We let treelength be defined by simple1, simple2, or affine • We compare treelengths found by POY to treelengths achievable in each model condition (as produced by scoring the true tree and other trees)

  25. Results – Simple Treelength Criteria

  26. Results – Affine Treelength Criterion

  27. Results - Treelengths • POY search finds short trees for simple gap penalties, but not for affine • Can we propose a better POY search for affine penalties? • POY*

  28. How well does POY solve its optimization problem? • Simple gap penalties: excellent performance • Affine gap penalties: poor performance But POY* optimizes both well. The difference is just the starting tree.

  29. Is it a good idea to optimize treelength? • Simple gap penalties: NO! Worse trees and worse alignments. • Affine gap penalties: Let’s see.

  30. POY vs. POY* using affine gap

  31. Insights Simple gap penalties were a main cause behind Ogden and Rosenberg's findings • Unable to obtain accurate POY alignments and trees under a simple treelength criterion Using affine penalties, POY*: • Obtains alignments that are more accurate than ClustalW 90% of long gap datasets, 75% of medium, 55% of short • Has tree accuracy that is comparable to the best two-phase method (ML on good alignments) • But poorer alignments than the best alignment methods (e.g., Probtree)

  32. Conclusions • Distinguish between the optimization problem, and the heuristic methods used for those problems • The treelength optimization criteria chosen has a significant impact on the tree and alignment error • Simple alignment and trees aren't competitive relative to two-phase methods, and improving simple criteria treelengths doesn't get better trees • Affine criteria story is still open • Can we find shorter trees than two-phase trees? • How accurate are such shorter trees?

  33. Questions?

More Related