320 likes | 651 Views
Phylogenetic trees. Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ sroy@biostat.wisc.edu Oct 1 st , 2013. Key concepts in this section. What are phylogenies or phylogenetic trees? Terminology such as extant, ancestral, branch point, branch length, orthologs , paralogs , taxon
E N D
Phylogenetic trees Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ sroy@biostat.wisc.edu Oct 1st, 2013
Key concepts in this section • What are phylogenies or phylogenetic trees? • Terminology such as extant, ancestral, branch point, branch length, orthologs, paralogs, taxon • Why build phylogenetic trees? • How to build phylogenetic trees? • Distance-based methods • Parsimony methods • Minimize the number of changes • Probabilistic methods • Find the tree that best explains the data using probabilistic models
What are phylogenetic trees? • A tree that describes evolutionary relationships among entities • Species, genes, strains • Leaves represent extant entities • Internal nodes represent ancestral species • Such a tree is inferred from observations in existing organisms.
Tree of life aims to represents the phylogeny of all species on earth From http://tellapallet.com/tree_of_life.htm
Phylogenetic tree of 29 mammals Lindbald-Toh et al, 2011, Nature
Why phylogenetic trees? • Understand how organisms are related • Do humans and chimpanzees share a common ancestor or do humans and gorillas? • Ask how closely organisms are related • Humans and chimpanzees shard a common ancestor 5mya • Provide insight into the evolutionary history of species • How specific functions have evolved • Language evolution • Identify signatures of conservation of sequence • Conjecture the fate of specific regions of the genome • Will the human Y disappear? • Inform multiple sequence alignments
Orthologs and paralogs • Orthologs: • Two sequences in two species that have a a common ancestor • Diverged due to a speciation event • Used to create a “species tree” • Paralogs: • Two sequences in the same species that arose from a gene duplication event • Captured in a “gene tree”.
Phylogenetic tree basics • Leaves represent things (genes, species, individuals/strains) being compared • the term taxon (taxa plural) is used to refer to these when they represent species and broader classifications of organisms • For example if taxa are species, the tree is a species tree • Internal nodes are hypothetical ancestral units • Phylogenetic trees can be rooted or unrooted • the root represents the common ancestor • In a rooted tree, path from root to a node represents an evolutionary path • Gives directionality to evolutionary time • An unrooted tree specifies relationships amongtaxa, but not from an ancestor
Tree basics Internal node: Ancestral 2 5 9 8 8 6 Branch length 7 6 7 Branch 1 3 5 4 1 2 4 3 Leaf node: Extant Unrooted tree Rooted tree Each tree topology represents a different evolutionary history For a species tree, internal nodes represent speciation events
Internal nodes represent ancestral species Tree of Life project (http://tolweb.org/tree/)
Rooting a tree • An unrooted tree can be converted to a rooted tree using an outgroup species • Outgroup: a species known to be more distantly related all the species than each of the species themselves • Find the branch where the outgroup is selected to be added • That gives the root
Tree counting • A rooted tree with n leaf nodes has • n-1 internal nodes • 2n-2 edges/branches • An unrooted tree with n leaf nodes has • n-2 internal nodes • 2n-3 edges/branches • A root can be added to any of these branches to give 2n-3 rooted trees for any unrooted tree • For three taxa there is one unrooted tree and three rooted trees
Tree counting 1 1 3 3 1 3 2 2 2 1 An unrooted tree 3 3 2 2 1 1 3 2 3 1 2 Possible positions for root Rooted trees
Tree counting • Instead of adding a root we could add a branch for the n+1thtaxon 4 1 1 1 3 3 3 2 2 2 1 4 1 3 2 3 2 1 1 3 3 2 2 4
Tree counting • With four nodes, we have five branches • Each of the branches can give rise to five trees of six nodes • Thus we have 3*5 trees • In general for n nodes we can have • (1)(3)(5)..(2n-5) unrooted trees
Constructing phylogenetic trees • Three types of methods • Distance based methods • Parsimony methods • Probabilistic approaches • Most methods start with pairwise distance methods • We have already seen one method!
Methods for phylogenetic tree reconstruction • Distance-based methods • UPGMA • Neighbor joining • Assume additivity and sometimes a “molecular clock” • Additivity means we can add up the branch lengths of the tree connecting two nodes and get their distances. • Alignment-based methods • Parsimony • Probabilistic
Defining distance between sequences • Fractional alignment difference for two sequences i and j • pij = mij/Lij • Gives an estimate of changes per site • mij: Number of mismatches • Assumes that changes have happened only once • Underestimates the distance between sequences • Assumes all sequences change at the same rate • Jukes Cantor distance • The simplest, evolutionary distance
UPGMA relies on the molecular clock assumption • Sequences diverge at the same rate at different points in the phylogeny • Distance from any leaf to root is the same. • If this is true the data is said to be ultrametric
The molecular clock assumption &ultrametric data • Ultrametricdata: for any triplet of sequences, i,j, k, the distances are either all equal, or two are equal and the remaining one is smaller 4 3 2 1 A E D B C
Problems with the molecular clock assumption 3 2 2 3 4 1 4 1 Constructed by UPGMA Actual tree
Neighbor joining • The ultra-metric property is too strong • Most sequences diverge at different rates • A more relaxed requirement is that of additivity • Distance between a pair of species/nodes is equal to the sum of the branch lengths • Uses a similar idea to construct trees as UPGMA • That is consider pairs of nodes and joins them • Produces unrooted trees
A B 0.1 0.1 0.1 0.4 0.4 D C How to select nodes for merging? • Given all pairwise distances for n sequences • dij denote the distance between node i and j • Should we select node pairs with the smallest dij? Should we merge A and B?
Need to correct for long branches L: current set of leaves ri : Average distance from other nodes
Defining the distance to a new node dkm? i m k j New node Given dij, dim, djm, how to calculate distance to new node k?
Algorithm for NJ • Initialization • T be set the of leaf nodes • L = T • Estimate ri for all i in L • Estimate Dij • Iteration • Pick a pair i, j from L such that Dij is smallest • Define new node k • Estimate dik, djk, add edge between kand i, and between k toj • Add k to T, remove i andj from L • Estimate Dmn for all nodes m,n in L • Terminate • If L has two nodes, add the edge between these two.
An example with neighbor joining • Consider 5 sequences: A, B, C, D, E • Distance matrix • What is the tree inferred by the Neighbor joining algorithm? B C D E A B C D E
Can we check for additivity? Check for additivity: For four leaves, i, j, k, l and the distances dij, dik, dil, djk, djl, dkl j i l k The three sums of two distances j i j j i i l l l k k k Should be such that two of these are equal, and larger than the third.