380 likes | 501 Views
Introduction to Phylogenetic Trees. BMI/CS 576 www.biostat.wisc.edu/bmi576.html Sushmita Roy sroy@ biostat.wisc.edu Oct 9 th , 2012. Phylogenetic inference : task d efinition. Given data characterizing a set of species/genes Do
E N D
Introduction to Phylogenetic Trees BMI/CS 576 www.biostat.wisc.edu/bmi576.html Sushmita Roy sroy@biostat.wisc.edu Oct 9th, 2012
Phylogenetic inference:task definition • Given • data characterizing a set of species/genes • Do • infer a phylogenetic tree that accurately characterizes the evolutionary lineages among the species/genes
What is a tree? • undirected case: a graph without cycles • directed case: underlying undirected graph is a tree (sometimes requires indegree(v) ≤ 1 for all v) • Node has one parent (predecessor)
Phylogenetic tree basics • leaves represent things (genes, species, individuals/strains) being compared • the term taxon (taxa plural) is used to refer to these when they represent species and broader classifications of organisms • internal nodes are hypothetical ancestral units • in a rooted tree, path from root to a node represents an evolutionary path • the root represents the common ancestor • an unrooted tree specifies relationships among things, but not from an ancestor
Motivation • Why construct phylogenetic trees? • to understand lineage of various species • to understand how various functions evolved • to inform multiple alignments • to identify what is most conserved/important in some class of sequences • to identify what is under accelerated evolution
Hox genes • Specify body patterning (anterior-posterior patterning). • Exhibit co-linearity. • Homologous genes acting in an apparently homologous way across the animal kingdom. Ferrier & Minguillion, 2003
Example species tree: 29 Mammals Numbers mean # of substitutions per 100 bps. Image fromLindbald-Tohet al., 2011
Genetic Analysis of Lice Supports Direct Contact between Modern and Archaic HumansD. Reed et al., PLoS Biology 2(11), November 2004. • inferred phylogeny of lice species closely parallels accepted phylogeny of their hosts • can phylogeny of lice tell us something about evolution of hosts?
Genetic Analysis of Lice Supports Direct Contact between Modern and Archaic HumansD. Reed et al., PLoS Biology 2(11), November 2004. • a more detailed phylogenetic analysis of human lice species shows two quite separate clades(subtrees) • Lice lineages seem to have diverged when lineage of H. sapiens diverged from extinct human lineage.
Genetic Analysis of Lice Supports Direct Contact between Modern and Archaic HumansD. Reed et al., PLoS Biology 2(11), November 2004. • this phylogeny supports a theory of human evolution in which • H. erectus and the ancestors of H. sapiens had little or no contact for a long period of time • there was contact between H. erectus and H. sapiens as late as 30,000 years ago
Data for building trees • trees can be constructed from various types of data • morphological features (e.g. # legs), fossils • DNA/protein sequences
5 1 8 7 4 6 2 3 Rooted vs.unrootedtrees 9 8 7 6 4 2 3 5 1 time
Number of possible trees • given n sequences, there are possible unrooted trees • and possible rooted trees
Phylogenetic tree approaches • three general types of methods • distance: find tree that accounts for estimated evolutionary distances • parsimony: find the tree that requires minimum number of changes to explain the data • maximum likelihood: find the tree that maximizes the likelihood of the data
Representing distances in rooted and unrootedtrees B C dist(A,C) = 8 dist(A,D) = 5 1.5 1.5 4 4 3 2 1 2.5 1 A E D B C D 1.5 1.5 E A distances represented by summed height of edges to reach common ancestor distances represented by summed length of edges to reach common ancestor
Distance-based approaches • given: an matrix MwhereMijis the distance between taxai andj • do: build an edge-weighted tree such that the distances between leaves i and j correspond to Mij 4 3 2 1 A E D B C
Where do we get distances? • commonly obtained from sequence alignments in alignment of sequence i with sequence j • to consider evolutionary time between sequences:
Distance metrics • properties of a distance metric
The UPGMA method(Unweighted Pair Group Method using Arithmetic Averages) • given ultrametric data, UPGMA will reconstruct the tree T that is consistent with the data • basic idea: • iteratively pick two taxa/clusters and merge them • create new node in tree for merged cluster • distance between clusters and of taxa is defined as • (avg. distance between pairs of taxa from each cluster)
UPGMA algorithm assign each taxon to its own cluster define one leaf for each taxon; place it at height 0 while more than two clusters determine two clusters i, jwith smallest define a new cluster define a node k with children i and j; place it at height replace clusters iand j with k compute distance between k and other clusters join last two clusters, iand j, by root at height
UPGMA • given a new cluster formed by merging and • we can calculate the distance between and any other cluster as follows
4 3 2 1 A E D B C 4 3 2 1 A E D B C UPGMA example initial state after one merge
4 3 2 1 A E D B C 4 3 2 4 1 3 2 1 A E D B C A E D B C UPGMA example (cont.) after two merges after three merges final state
UPGMA relies on the molecular clock assumption • Sequences diverge at the same rate at different points in the phylogeny • Distance from any leaf to root is the same.
The molecular clock assumption & ultrametric data • The molecular clock assumption: sequences are diverging at the every point in the phylogeny at the same rate. • This assumption is not generally true: selection pressures vary across time periods, organisms, genes within an organism, regions within a gene • if it does hold, then the data is said to be ultrametric
The molecular clock assumption &ultrametric data • ultrametric data: for any triplet of sequences, i,j, k, the distances are either all equal, or two are equal and the remaining one is smaller 4 3 2 1 A E D B C
Neighbor joining • unlike UPGMA • doesn’t make molecular clock assumption • produces unrooted trees • does assume additivity: distance between pair of leaves is sum of lengths of edges connecting them • like UPGMA, constructs a tree by iteratively joining subtrees • two key differences • how pair of subtrees to be merged is selected on each iteration • how distances are updated after each merge
A B 0.1 0.1 0.1 0.4 0.4 D C • wrong decision to join A and B: need to consider distance of pair to other leaves Picking pairs of nodes to join in NJ • at each step, we pick a pair of nodes to join; should we pick a pair with minimal ? • suppose the real tree looks like this and we’re picking the first pair of nodes to join?
Picking pairs of nodes to join in NJ • to avoid this, pick pair to join based on [Saitou & Nei ’87; Studier & Keppler ’88] where L is the set of leaves
m i k j Updating distances in neighbor joining • given a new internal node k, the distance to another node m is given by:
m i k j Updating distances in neighbor joining • can calculate the distance from a leaf to its parent node in the same way
Updating distances in neighbor joining • we can generalize this so that we take into account the distance to all other leaves where and L is the set of leaves
Neighbor joining algorithm define the treeT = set of leaf nodes L = T while more than two subtrees in T pick the pair i, jin Lwith minimal add to T a new node k joining i and j determine new distances remove iand jfrom L and insert k(treat it like a leaf) join two remaining subtrees,i and jwith edge of length
3 1 4 2 Testing foradditivity • for every set of four leaves, i, j, k, and l, two of the distances , and must be equal and not less than the third 3 1 4 2 3 3 1 1 4 4 2 2
Rooting trees • finding a root in an unrooted tree is sometimes accomplished by using an outgroup • outgroup: a species known to be more distantly related to remaining species than they are to each other • edge joining the outgroup to the rest of the tree is best candidate for root position outgroup 1 5 candidate root 8 7 4 6 2 3
Rooting trees chimpanzee lice used as outgroup in human lice study
Comments on distance-based methods • if the given distance data is ultrametric (and these distances represent real distances), then UPGMA will identify the correct tree • if the data is additive (and these distances represent real distances), then neighbor joining will identify the correct tree • otherwise, the methods may not recover the correct tree, but they may still be reasonable heuristics • neighbor joining is commonly used