Class 9: Phylogenetic Trees

Class 9: Phylogenetic Trees .

The Tree of Life D’après Ernst Haeckel, 1891

Evolution • Many theories of evolution • Basic idea: • speciation events lead to creation of different species • Speciation caused by physical separation into groups where different genetic variants become dominant • Any two species share a (possibly distant) common ancestor

Phylogenies • A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species • Leafs - current day species • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the next Aardvark Bison Chimp Dog Elephant

branch internal node leaf Phylogenetic Tree • Topology: bifurcating • Leaves - 1…N • Internal nodes N+1…2N-2

Example: Primate evolution 20-25 mya 35-37 mya 40-45 mya

How to construct a Phylogeny? • Until mid 1950’s phylogenies were constructed by experts based on their opinion (subjective criteria) • Since then, focus on objective criteria for constructing phylogenetic trees • Thousands of articles in the last decades • Important for many aspects of biology • Classification (systematics) • Understanding biological mechanisms

Morphological vs. Molecular • Classical phylogenetic analysis: morphological features • number of legs, lengths of legs, etc. • Modern biological methods allow to use molecular features • Gene sequences • Protein sequences • Analysis based on homologous sequences (e.g., globins) in different species

Dangers in Molecular Phylogenies • We have to remember that gene/protein sequence can be homologous for different reasons: • Orthologs -- sequences diverged after a speciation event • Paralogs -- sequences diverged after a duplication event • Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus)

Dangers of Paralogues Gene Duplication Speciation events 2B 1B 3A 3B 2A 1A

Dangers of Paralogs • If we only consider 1A, 2B, and 3A... Gene Duplication Speciation events 2B 1B 3A 3B 2A 1A

Types of Trees • A natural model to consider is that of rooted trees Common Ancestor

Types of Trees • Depending on the model, data from current day species does not distinguish between different placements of the root vs

Types of trees • Unrooted tree represents the same phylogeny with out the root node

Positioning Roots in Unrooted Trees • We can estimate the position of the root by introducing an outgroup: • a set of species that are definitely distant from all the species of interest Proposed root Falcon Aardvark Bison Chimp Dog Elephant

Types of Data • Distance-based • Input is a matrix of distances between species • Can be fraction of residues they disagree on, or -alignment score between them, or … • Character-based • Examine each character (e.g., residue) separately

Simple Distance-Based Method Input: distance matrix between species Outline: • Cluster species together • Initially clusters are singletons • At each iteration combine two “closest” clusters to get a new one

UPGMA Clustering • Let Ci and Cj be clusters, define distance between them to be • When combining two clusters, Ci and Cj, to form a new cluster Ck, then

Molecular Clock • UPGMA implicitly assumes that all distances measure time in the same way 2 3 2 3 4 1 4 1

Additivity • A weaker requirement is additivity • In “real” tree, distances between species are the sum of distances between intermediate nodes k c b j a i

Consequences of Additivity • Suppose input distances are additive • For any three leaves • Thus k c b j a m i

Neighbor Joining • Can we use this fact to construct trees? • Let where Theorem: if D(i,j) is minimal (among all pairs of leaves), then i and j are neighbors in the tree

k m j i Neighbor Joining • Set L to contain all leaves Iteration: • Choose i,j such that D(i,j) is minimal • Create new node k, and set • remove i,j from L, and add k Terminate:when |L| =2, connect two remaining nodes

Distance Based Methods • If we make strong assumptions on distances, we can reconstruct trees • In real-life distances are not additive • Sometimes they are close to additive

Character Based Methods • We start with a multiple alignment • Assumptions: • All sequences are homologous • Each position in alignment is homologous • Positions evolve independently • No gaps • We seek to explain the evolution of each position in the alignment

Parsimony • Character-based method • A way to score trees (but not to build trees!) Assumptions: • Independence of characters (no interactions) • Best tree is one where minimal changes take place

Aardvark Bison Chimp Dog Elephant A Simple Example • What is the parsimony score of A: CAGGTA B: CAGACA C: CGGGTA D: TGCACT E: TGCGTA

A Simple Example A: CAGGTA B: CAGACA C: CGGGTA D: TGCACT E: TGCGTA • Each column is scored separately. • Let’s look at the first column: • Minimal tree has one evolutionary change: C T C T C C C T T  C

Evaluating Parsimony Scores • How do we compute the Parsimony score for a given tree? • Traditional Parsimony • Each base change has a cost of 1 • Weighted Parsimony • Each change is weighted by the score c(a,b)

a g a Traditional Parsimony a {a} • Solved independently for each position • Linear time solution a {a,g}

Evaluating Weighted Parsimony Dynamic programming on the tree S(i,a) = cost of tree rooted at i if i is labeled by a Initialization: • For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) =  Iteration: • if k is a node with children i and j, then S(k,a) = minb(S(i,b)+c(a,b)) + minb(S(j,b)+c(a,b)) Termination: • cost of tree is minaS(r,a) where r is the root

Cost of Evaluating Parsimony • Score is evaluated on each position independetly. Scores are then summed over all positions. • If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk) • By keeping traceback information, we can reconstruct most parsimonious values at each ancestor node

Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G How many possible unrooted trees?

Maximum Parsimony How many possible unrooted trees? 1 2 3 4 5 6 7 8 9 10 Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G

Maximum Parsimony How many substitutions? MP

0 0 0 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G

0 3 0 3 0 3 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G

G T 3 C A C G C 3 T A C G T 3 A C C Maximum Parsimony 2 1 - G 2 - C 3 - T 4 - A

0 3 2 0 3 2 0 3 2 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G

0 3 2 2 0 3 2 2 0 3 2 1 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G

G A 2 A G A G A 2 A G A A G 1 A G A Maximum Parsimony 4 1 - G 2 - A 3 - A 4 - G

0 3 2 2 0 1 1 1 1 3 14 0 3 2 2 0 1 2 1 2 3 16 0 3 2 1 0 1 2 1 2 3 15 Maximum Parsimony

Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 3 2 2 0 1 1 1 1 3 14

Searching for Trees

Searching for the Optimal Tree • Exhaustive Search • Very intensive • Branch and Bound • A compromise • Heuristic • Fast • Usually starts with NJ

branch internal node leaf Phylogenetic Tree Assumptions • Topology: bifurcating • Leaves - 1…N • Internal nodes N+1…2N-2 • Lengths t = {ti} for each branch • Phylogenetic tree = (Topology, Lengths) = (T,t)

Probabilistic Methods • The phylogenetic tree represents a generative probabilistic model (like HMMs) for the observed sequences. • Background probabilities: q(a) • Mutation probabilities: P(a|b,t) • Models for evolutionary mutations • Jukes Cantor • Kimura 2-parameter model • Such models are used to derive the probabilities

Jukes Cantor model • A model for mutation rates • Mutation occurs at a constant rate • Each nucleotide is equally likely to mutate into any other nucleotide with rate a.

Kimura 2-parameter model • Allows a different rate for transitions and transversions.

Mutation Probabilities • The rate matrix R is used to derive the mutation probability matrix S: • S is obtained by integration. For Jukes Cantor: • q can be obtained by setting t to infinity

Class 9: Phylogenetic Trees

Class 9: Phylogenetic Trees

Presentation Transcript

Landscaping with Trees and Shrubs

Introductory Biology in-class interactive lecture on evolution.

Angiosperm Classification From Linnaeus to the Angiosperm Phylogeny Group and beyond

Pests on Trees

Tournament Trees

Computational methods in phylogenetic analysis

Phylogenetic Tree Construction

Phylogenetic tree of the major lineages (phyla) of Bacteria based on 16S ribosomal RNA sequence comparisons

Graphs, Networks, Trees

From Gene Trees to Species Trees

Red-black trees

Trees

Trees

Spatial Access Methods

Tools for multiple sequence alignment

Biology SOL Focus Reviews

CLASS 301

Binary Search Trees

Chapter 10 BINARY TREES

Trees

Chapter 10 Trees

AVL Trees