1 / 66

Class 9: Phylogenetic Trees

Class 9: Phylogenetic Trees. The Tree of Life. D’après Ernst Haeckel, 1891 . Evolution. Many theories of evolution Basic idea: speciation events lead to creation of different species Speciation caused by physical separation into groups where different genetic variants become dominant

Donna
Download Presentation

Class 9: Phylogenetic Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Class 9: Phylogenetic Trees .

  2. The Tree of Life D’après Ernst Haeckel, 1891

  3. Evolution • Many theories of evolution • Basic idea: • speciation events lead to creation of different species • Speciation caused by physical separation into groups where different genetic variants become dominant • Any two species share a (possibly distant) common ancestor

  4. Phylogenies • A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species • Leafs - current day species • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the next Aardvark Bison Chimp Dog Elephant

  5. branch internal node leaf Phylogenetic Tree • Topology: bifurcating • Leaves - 1…N • Internal nodes N+1…2N-2

  6. Example: Primate evolution 20-25 mya 35-37 mya 40-45 mya

  7. How to construct a Phylogeny? • Until mid 1950’s phylogenies were constructed by experts based on their opinion (subjective criteria) • Since then, focus on objective criteria for constructing phylogenetic trees • Thousands of articles in the last decades • Important for many aspects of biology • Classification (systematics) • Understanding biological mechanisms

  8. Morphological vs. Molecular • Classical phylogenetic analysis: morphological features • number of legs, lengths of legs, etc. • Modern biological methods allow to use molecular features • Gene sequences • Protein sequences • Analysis based on homologous sequences (e.g., globins) in different species

  9. Dangers in Molecular Phylogenies • We have to remember that gene/protein sequence can be homologous for different reasons: • Orthologs -- sequences diverged after a speciation event • Paralogs -- sequences diverged after a duplication event • Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus)

  10. Dangers of Paralogues Gene Duplication Speciation events 2B 1B 3A 3B 2A 1A

  11. Dangers of Paralogs • If we only consider 1A, 2B, and 3A... Gene Duplication Speciation events 2B 1B 3A 3B 2A 1A

  12. Types of Trees • A natural model to consider is that of rooted trees Common Ancestor

  13. Types of Trees • Depending on the model, data from current day species does not distinguish between different placements of the root vs

  14. Types of trees • Unrooted tree represents the same phylogeny with out the root node

  15. Positioning Roots in Unrooted Trees • We can estimate the position of the root by introducing an outgroup: • a set of species that are definitely distant from all the species of interest Proposed root Falcon Aardvark Bison Chimp Dog Elephant

  16. Types of Data • Distance-based • Input is a matrix of distances between species • Can be fraction of residues they disagree on, or -alignment score between them, or … • Character-based • Examine each character (e.g., residue) separately

  17. Simple Distance-Based Method Input: distance matrix between species Outline: • Cluster species together • Initially clusters are singletons • At each iteration combine two “closest” clusters to get a new one

  18. UPGMA Clustering • Let Ci and Cj be clusters, define distance between them to be • When combining two clusters, Ci and Cj, to form a new cluster Ck, then

  19. Molecular Clock • UPGMA implicitly assumes that all distances measure time in the same way 2 3 2 3 4 1 4 1

  20. Additivity • A weaker requirement is additivity • In “real” tree, distances between species are the sum of distances between intermediate nodes k c b j a i

  21. Consequences of Additivity • Suppose input distances are additive • For any three leaves • Thus k c b j a m i

  22. Neighbor Joining • Can we use this fact to construct trees? • Let where Theorem: if D(i,j) is minimal (among all pairs of leaves), then i and j are neighbors in the tree

  23. k m j i Neighbor Joining • Set L to contain all leaves Iteration: • Choose i,j such that D(i,j) is minimal • Create new node k, and set • remove i,j from L, and add k Terminate:when |L| =2, connect two remaining nodes

  24. Distance Based Methods • If we make strong assumptions on distances, we can reconstruct trees • In real-life distances are not additive • Sometimes they are close to additive

  25. Character Based Methods • We start with a multiple alignment • Assumptions: • All sequences are homologous • Each position in alignment is homologous • Positions evolve independently • No gaps • We seek to explain the evolution of each position in the alignment

  26. Parsimony • Character-based method • A way to score trees (but not to build trees!) Assumptions: • Independence of characters (no interactions) • Best tree is one where minimal changes take place

  27. Aardvark Bison Chimp Dog Elephant A Simple Example • What is the parsimony score of A: CAGGTA B: CAGACA C: CGGGTA D: TGCACT E: TGCGTA

  28. A Simple Example A: CAGGTA B: CAGACA C: CGGGTA D: TGCACT E: TGCGTA • Each column is scored separately. • Let’s look at the first column: • Minimal tree has one evolutionary change: C T C T C C C T T  C

  29. Evaluating Parsimony Scores • How do we compute the Parsimony score for a given tree? • Traditional Parsimony • Each base change has a cost of 1 • Weighted Parsimony • Each change is weighted by the score c(a,b)

  30. a g a Traditional Parsimony a {a} • Solved independently for each position • Linear time solution a {a,g}

  31. Evaluating Weighted Parsimony Dynamic programming on the tree S(i,a) = cost of tree rooted at i if i is labeled by a Initialization: • For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) =  Iteration: • if k is a node with children i and j, then S(k,a) = minb(S(i,b)+c(a,b)) + minb(S(j,b)+c(a,b)) Termination: • cost of tree is minaS(r,a) where r is the root

  32. Cost of Evaluating Parsimony • Score is evaluated on each position independetly. Scores are then summed over all positions. • If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk) • By keeping traceback information, we can reconstruct most parsimonious values at each ancestor node

  33. Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G How many possible unrooted trees?

  34. Maximum Parsimony How many possible unrooted trees? 1 2 3 4 5 6 7 8 9 10 Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G

  35. Maximum Parsimony How many substitutions? MP

  36. 0 0 0 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G

  37. 0 3 0 3 0 3 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G

  38. G T 3 C A C G C 3 T A C G T 3 A C C Maximum Parsimony 2 1 - G 2 - C 3 - T 4 - A

  39. 0 3 2 0 3 2 0 3 2 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G

  40. 0 3 2 2 0 3 2 2 0 3 2 1 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G

  41. G A 2 A G A G A 2 A G A A G 1 A G A Maximum Parsimony 4 1 - G 2 - A 3 - A 4 - G

  42. 0 3 2 2 0 1 1 1 1 3 14 0 3 2 2 0 1 2 1 2 3 16 0 3 2 1 0 1 2 1 2 3 15 Maximum Parsimony

  43. Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 3 2 2 0 1 1 1 1 3 14

  44. Searching for Trees

  45. Searching for the Optimal Tree • Exhaustive Search • Very intensive • Branch and Bound • A compromise • Heuristic • Fast • Usually starts with NJ

  46. branch internal node leaf Phylogenetic Tree Assumptions • Topology: bifurcating • Leaves - 1…N • Internal nodes N+1…2N-2 • Lengths t = {ti} for each branch • Phylogenetic tree = (Topology, Lengths) = (T,t)

  47. Probabilistic Methods • The phylogenetic tree represents a generative probabilistic model (like HMMs) for the observed sequences. • Background probabilities: q(a) • Mutation probabilities: P(a|b,t) • Models for evolutionary mutations • Jukes Cantor • Kimura 2-parameter model • Such models are used to derive the probabilities

  48. Jukes Cantor model • A model for mutation rates • Mutation occurs at a constant rate • Each nucleotide is equally likely to mutate into any other nucleotide with rate a.

  49. Kimura 2-parameter model • Allows a different rate for transitions and transversions.

  50. Mutation Probabilities • The rate matrix R is used to derive the mutation probability matrix S: • S is obtained by integration. For Jukes Cantor: • q can be obtained by setting t to infinity

More Related