310 likes | 423 Views
How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences. Tanya Y. Berger-Wolf Laboratory for High-Performance Algorithm Engineering and Computational Biology Dept. of Computer Science University of New Mexico www.compbio.unm.edu.
E N D
How to See a Tree for a Forest?Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance Algorithm Engineering and Computational BiologyDept. of Computer ScienceUniversity of New Mexico www.compbio.unm.edu
Phylogeny Reconstruction Orangutan Gorilla Chimpanzee Human
Phylogeny Reconstruction • Get an estimate of evolutionary distance between species • Treat the species as a set of points with pairwise distance measure • Find a tree that optimizes{parsimony, likelihood, function of your choice}on that set of points
Overview of My Research • Computational Phylogeny • Comparison of methods that combine trees (greed is bad) • Topological accuracy of maximum parsimony • Is optimal necessary? • How to know when “good enough”? • Online consensus and other statistics • Heterogeneous data in phylogeny • Controlled animal breeding strategies • Computational Phylogeny • Comparison of methods that combine trees (greed is bad) • Topological accuracy of maximum parsimony • Is optimal necessary? • How to know when “good enough”? • Online consensus and other statistics • Heterogeneous data in phylogeny • Controlled animal breeding strategies
Computational Pitfalls • Resulting optimization problems are hard • Existing heuristics expensive on large datasets • Same score – many topologies • True tree is unknown ⇓ When to stop and what to return?
A A B C + C B D D E E A B = C D E Consensus Methods Consensus is what many people say in chorus but do not believe as individuals Abba Eban (1915 - 2002), Israeli diplomat In "The New Yorker," 23 Apr 1990
A A A B B B C C C D D D E E E A B C D E Consensus Methods: StrictMcMorris et al. (83) AB CD ABCDABCDE AB ABC DEABCDE BCD ABCDABCDE Strict: contains clades common to all trees
A A B B C C D D E E AB CD ABCD AB ABC DE BCD ABCD A B C D E Consensus Methods: MajorityMargush & McMorris (81), McMorris et al. (83), Barthelemy & McMorris (86) A B C D E AB CD ABCDABCDE AB ABC DEABCDE BCD ABCDABCDE Majority: contains clades common to majority
Stopping Maximum Parsimony(joint work with T.Williams, B.M.E.Moret, U.Roshan, T.Warnow) If return Majority Consensus of the top scoring trees how early can we stop without changing the outcome? What stopping criteria?
Majority consensus ofbest and second bestso far Majority consensus ofoptimal trees (PAUP*) Output consensus tree Experiment Design ATTCGGAAGCGATAGCTGAATCGATCGATCGTATTACGTTAGCTAGTATGCAGCGGAG Biological dataset Run parsimony ratchet (PAUP*)500 iterations, 5 repetitionsSave the tree at each iteration … Optimal - best scoring treesin all repetitions
Online Consensus Input:T1, T2, …, Tk with n leaves, one at a time Output: Majority Consensus tree Mi of T1,…,Ti Solution: Maintain set of clades C with counters When Tiarrives, need to consider only the clades in Ti and Mi-1, total of 2n
Conclusions and Future • Evidence that can stop parsimony search early • Need simulations and more data to verify • Collect other (than consensus) statistics • Other stopping criteria • Different representation of finalsets of trees • Other methods
Wait! There is more!Part II: Heterogeneous Data(joint work with Tandy Warnow)
Heterogeneous Data Molecular data: DNA and genomes
Heterogeneous Data Paleontological, morphological, geographical, historical data
Data As Constraints Constraints, not distance! • Positive: these species are together(phylogenetic trees, presence of a morphological character) • Negative: these species are not together (above + geography, fossils) • Temporal: these events happened in this order (fossils, history) • Frequency: this even happens more often than another (adaptation mechanisms)
A A B B C C D D E E AB CD ABCD AB ABC DE A A A A B B B B C C C C D D D D E E E E Consensus Methods: Greedy A B C D E AB CD ABCDABCDE AB ABC DEABCDE BCD ABCDABCDE Greedy: resolves majority by adding compatible clades
A A B B C C D D E E ABC AB CD ABCD ABCDE AB CD ABCD ABCDE AB ABCD AB ABC ABCD ABCDE CD BCD DE Consensus Methods: AMTPhillips & Warnow (95) A B C D E AB CD ABCDABCDE AB ABC DEABCDE BCD ABCDABCDE Asymmetric Median Tree: maximum (weighted) collection of compatible clades
Consensus of Positive Constraints Formalize constraint, go through existing consensus methods, see if satisfies or can be extended Partially from Steel et al. 2000
Consensus of Negative Constraints • a and b are separated by C • C is closer to a than b – same as positive
Conclusions and Future (Part II) • Existing methods are insufficient • (Consensus with respect to temporal, frequency constraints) • Developing new methods that preserve 4 types of constraints • Network phylogeny • Error measure and evaluation of quality
Population biology • Discrete methods for small populations(esp. conservation) • Epidemiology • Flu SIR model, combining data • Vaccination strategies Even Bigger Future • Phylogeny • Getting good reconstructions fast • Heterogeneous data • Network phylogeny
Thank you Work is supported by the National Science Foundation postdoctoral fellowship grant EIA 02-03584
Controlled Breeding(joint work with Cris Moore and Jared Saia) Given an initial population of animals design a mating strategy that achieves a breeding goal (within shortest time)
Controlled Breeding: Background • Conservation Biology and Agriculture • Breeding strategies: designed and evaluated empirically or using stochastic time-step modeling • Empirical evaluation – too slow! • Stochastic modeling – mathematically and biologically inappropriate. • Classic algorithm design problem
Breeding All Possible Animals Givenk binary strings of length nDesign an algorithm that Produces all possible strings With the smallest expected # matings Greedy: mate two animals with the highest probability of producing new Upper bound: 2.32•2n
Breeding a Target Animal Givenk strings of length nDesign an algorithm that Produces a target string With the smallest expected # matings Alg 1: breed for one trait at a time O(n lg n) Alg 2: breed the animals closest to the targetO(n2)
Algorithm: One Trait at a Time AddOneTrait (11…100...0, 00…010…0) x = 11…100…0 y = 00…010…0 While (y has < i+1 ones) do Mate x and y twice y = string with 1 in bit (i+1) Return y • The Algorithm (e1,e2,…,en) • x = e1 • For x = 2..n do • x = AddOneTrait(x,ei)
More Realistic Breeding • Gender • Variable probability of outcome • Deaths • Minimize number of generations • Goal: maximum diversity • On-line: maintain the distribution