1 / 59

Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance

Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance. Tandy Warnow Department of Computer Sciences University of Texas at Austin. The real title:. Phylogeny Estimation: Why it is “Hard” but not how to design methods with good performance -

Download Presentation

Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas at Austin

  2. The real title: Phylogeny Estimation: Why it is “Hard” but not how to design methods with good performance - talk to me separately about this, no time in this lecture!

  3. This talk • Intro to phylogenetic estimation (using some terms to be defined later: polynomial time and NP-hard) • Computational problems and what it means to solve them exactly • Computational problems, and what it means to “solve them” heuristically

  4. Phylogeny (evolutionary tree) Orangutan Human Gorilla Chimpanzee From the Tree of the Life Website,University of Arizona

  5. Evolutionary History • Helps us • predict gene function • develop drugs and vaccines • understand disease spread • understand human origins Phylogenetics: estimating evolutionary histories Tree of Life From AToL website

  6. -3 mil yrs AAGACTT AAGACTT -2 mil yrs AAGGCCT AAGGCCT AAGGCCT AAGGCCT TGGACTT TGGACTT TGGACTT TGGACTT -1 mil yrs AGGGCAT AGGGCAT AGGGCAT TAGCCCT TAGCCCT TAGCCCT AGCACTT AGCACTT AGCACTT today AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT DNA Sequence Evolution

  7. Phylogeny Problem U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT X U Y V W

  8. How can we infer evolution?

  9. Local optimum Cost Global optimum Phylogenetic trees Two types of phylogenetic reconstruction methods • Heuristics for hard optimization problems (Maximum Parsimony and Maximum Likelihood) • Polynomial time distance-based methods: UPGMA, Neighbor Joining, FastME, Weighbor, etc.

  10. Maximum Parsimony • Input: Set S of n aligned sequences of length k • Output: • A phylogenetic tree T leaf-labeled by sequences in S • additional sequences of length k labeling the internal nodes of T such that the total number of changes is minimized

  11. Maximum parsimony (example) • Input: Four sequences • ACT • ACA • GTT • GTA • Question: which of the three trees has the best MP scores?

  12. Maximum Parsimony ACT ACT ACA GTA GTT GTT ACA GTA GTA ACA ACT GTT

  13. Maximum Parsimony ACT ACT ACA GTA GTT GTA ACA ACT 2 1 1 3 3 2 GTT GTT ACA GTA MP score = 7 MP score = 5 GTA ACA ACA GTA 2 1 1 ACT GTT MP score = 4 Optimal MP tree

  14. Optimal labeling can be computed in linear time O(nk) GTA ACA ACA GTA 2 1 1 ACT GTT MP score = 4 But how do we find the best tree? Maximum Parsimony: computational complexity

  15. Exhaustive Search For every tree in on the set of sequences, DO: • Score each tree (compute optimal sequences for each internal node, and record the score) • Keep track of the tree with the best score How expensive is this?

  16. Exhaustive Search For every tree in on the set of sequences, DO: • Score each tree (compute optimal sequences for each internal node, and record the score) • Keep track of the tree with the best score How expensive is this?

  17. Don’t try “exhaustive search” • Number of (unrooted) binary trees on n leaves is (2n-5)!! = (2n-5)x(2n-7)x(2n-9)x…x3 • If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in 2890 millennia

  18. Optimal labeling can be computed in linear time O(nk) GTA ACA ACA GTA 2 1 1 ACT GTT MP score = 4 Finding the optimal MP tree is NP-hard Maximum Parsimony: computational complexity

  19. NP-hard(ness) • What does this mean? • What are the consequences for a problem being NP-hard? • What kind of methods are used to “solve” NP-hard problems? • How should you interpret the output of a software program, when the problem is NP-hard?

  20. “Real” problem: your brother’s birthday party • Your brother is turning 10 and you need to arrange his birthday party • He wants all his friends to come • But some of them hate each other Your objective: have as few parties as you can, but invite everyone to at least one party (while not having people who hate each other at the same party)

  21. Your brother’s party • Friends: Sally, Alice, Henry, Tommy, Jimmy, and Ben • Sally and Alice hate each other, also Henry and Sally, Henry and Tommy, Alice and Jimmy, Ben and Sally, and Ben and Henry.

  22. Graph representation of your brother’s friends Graph has vertices and edges • Vertices = your brother’s friends • Edges between vertices indicate they hate each other

  23. Your brother’s party • Friends: Sally, Alice, Henry, Tommy, Jimmy, and Ben • Sally and Alice hate each other, also Henry and Sally, Henry and Tommy, Alice and Jimmy, Ben and Sally, and Ben and Henry.

  24. Coloring vertices to assign friends to parties • Given graph G with vertices and edges • Assign colors to the vertices so that no edge connects vertices of the same color, using a minimum number of colors • Vertices = your brother’s friends • Edges between vertices indicate they hate each other • Colors = parties

  25. Assigning friends to parties: graph coloring! • Friends: Sally, Alice, Henry, Tommy, Jimmy, and Ben • We can’t do this with two parties. Why? • What about three?

  26. Your brother’s parties Solution: three parties! • Sally, Tommy, and Jimmy • Henry and Alice • Ben

  27. What is the minimum number of colors that a graph needs? Remember: no edge between vertices of the same color!

  28. A graph that needs 3 colors

  29. 2-colored graph

  30. A computational problem • 2-colorability: • Given graph G, determine if we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.

  31. Can we 2-color this graph?

  32. Can we 2-color these graphs?

  33. Solving 2-colorability • 2-colorability: Given graph G, determine if we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color. • Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored. • Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

  34. Solving 2-colorability • 2-colorability: Given graph G, determine if we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color. • Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored. • Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

  35. Solving 2-colorability • 2-colorability: Given graph G, determine if we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color. • Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored. • Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

  36. What about this? • 3-colorability: Given graph G, determine if we can assign red,blue, and green to the vertices in G so that no edge connects vertices of the same color.

  37. A 3-colored graph

  38. Can you 3-color these graphs?

  39. How about this graph?

  40. Testing 3-colorability • 3-colorability: Given graph G, determine if we can assign red,blue, and green to the vertices in G so that no edge connects vertices of the same color. The “greedy algorithm” will work correctly in some, but not all cases.

  41. Exhaustive search for 3-colorability • Look at all possible vertex colorings • See if any is “legal” (no edge between vertices of the same color) Problem: there are 3n vertexcolorings of a graph on n vertices Question to students: how many vertex colorings are there for a graph with 10 vertices? 20 vertices? 100 vertices?

  42. What about this? • 3-colorability: Given graph G, determine if we can assign red,blue, and green to the vertices in G so that no edge connects vertices of the same color.

  43. What about this? • 3-colorability: Given graph G, determine if we can assign red,blue, and green to the vertices in G so that no edge connects vertices of the same color. • This problem is NP-hard. • What does this mean?

  44. Some decision problems can be solved in polynomial time: • Can graph G be 2-colored? • Does graph G have a 3-clique (three vertices that are all adjacent)? • Some decision problems seem to not be solvable in polynomial time: • Can graph G be 3-colored? • What is the size of the largest clique in the graph G?

  45. P vs. NP, continued • The “big” question in theoretical computer science is: • Is it possible to solve an NP-hard problem in polynomial time? • If the answer is “yes”, then all NP-hard problems can be solved in polynomial time, so P=NP. This is generally not believed.

  46. Minimum coloring • Since 3-colorability is NP-hard, finding the minimum number of colors for a graph is NP-hard. • That means the problem will be very hard on some graphs -- even if others can be easy. • So if your brother has a lot of friends, arranging the minimum number of parties could take you a very veryveryveryvery long time. • So forget solving this problem exactly!

  47. Solving NP-hard optimization problems (like min coloring) Options: • Solve the problem exactly (but use lots of time on some inputs) • Use heuristics which may not solve the problem exactly (and which might be computationally expensive, anyway)

  48. Phylogeny estimation is NP-hard, so • Most methods that are used for maximum parsimony (or maximum likelihood) are heuristics that are not guaranteed to solve the problems exactly. • Even the best methods can take a very long time (months or more) on some inputs, without being guaranteed to solve their problems well. • You do not know how poor the solution is.

  49. Hill-climbing for phylogeny estimation Start with some tree and score it Repeat Change the tree slightly, and see if the new tree has a better score. until no neighbor of your best tree has a better score (i.e., stop at a local optimum) Return the best tree you found

  50. Exploring “tree space” • Tree space move: • Short range move: • Nearest neighbor interchange (NNI): swap two subtrees on the two sides of an internal edge. • Long range move: • Bisection and reconnection: cut the tree in two subtees along an edge, and then rejoin the two subtrees to become a different tree.

More Related