1 / 37

Bioinformatics ICES 2006

This article discusses the process of phylogenetic inference, which involves inferring evolutionary relationships and building phylogenetic trees to estimate divergence times between organisms. The article explains different methods of tree building, including distance-based and character-based approaches, and discusses tree evaluation techniques such as bootstrapping and maximum parsimony.

hintz
Download Presentation

Bioinformatics ICES 2006

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BioinformaticsICES 2006 Molecular Evolution Revised 29/12/06

  2. Phylogeny is the inference of evolutionary relationships • All forms of life share a common origin. • deduce the correct trees for all species of life • to estimate the time of divergence between organisms since the time they last shared a common ancestor

  3. Terminology • Phylogenetic trees that are used to assess the relationships of homologous proteins (or nucleotide sequences) in a family Clade Bifurcating node Branch OUT or external node Internal node Phylogram

  4. Terminology

  5. Terminology Species tree versus gene tree • In a species tree an internal node represents a speciation event • In a gene tree an internal node represents the divergence of an ancestral gene into two new genes with distinct sequences • Species tree <> Gene tree • horizontal gene transfer • gene duplications

  6. Species tree versus gene tree Gray et al.

  7. Phylogenetic inference • Selection of sequences for analysis • Multiple sequence alignment • Tree building • Tree evaluation

  8. Phylogenetic inference • selection of sequences for analysis DNA: • Higher phylogenetic signal: • Synonymous vs nonsynonymous substitutions (detect negative and positive selection) Protein: • Phylogenetic signal less predominant than in DNA • Better to construct a tree for evolutionary distant species or genes RNA: rRNA often used for constructing species trees

  9. Phylogenetic inference 2. multiple sequence alignment • This is a critical step in the analysis as in many cases the alignment of amino acids or nucleotides in a column implies that they share a common ancestor • If you misalign a group of sequences you will still be able to produce a tree. However, it is not likely to be biologically meaningful. Crap in is crap out! • Inspect the alignment to be sure that all sequences are homologous • Some times with ClustalW distantly related sequences are not well aligned. Try different gap and extension parameters to improve the alignment • Only use these columns of the multiple alignment for which you have data for all organisms or sequences. Delete the columns for which this is not the case. • Delete columns with gaps

  10. Phylogenetic inference 3. Tree building

  11. Distance based methods Distance based methods: • calculate the distances between molecular sequences using some distance metric • A clustering method (UPGMA, neighbour joining) is used to infer the tree from the pairwise distance matrix • treat the sequence from a horizontal perspective, by calculating a single distance between entire sequences Advantage: • Fast • Allow using evolutionary models Disadvantage: • sequences reduced to one number

  12. Character based methods Character based methods: • treat the sequences from a vertical perspective • they search for each column of the alignment, the simplest explanation for how the characters evolved. • For instance, MP involves a search for a tree with the fewest number of amino acid (or nucleotide character changes that account for the observed differences between the protein (gene) sequences.

  13. Phylogenetic inference 4. Tree evaluation: bootstrapping • sampling technique for estimating the statistical error in situations where the underlying sampling distribution is unknown • evaluating the reliability of the inferred tree - or better the reliability of specific branches How to proceed: • From the original alignment, columns in the sequence alignment are chosen at random ‘sampling with replacement’ • a new alignment is constructed with the same size as the original one • a tree is constructed This process is repeated 100 of times

  14. Phylogenetic inference Show bootstrap values on phylogenetic trees • majority-rule consensus tree • map bootstrap values on the original tree

  15. Maximum parsimony Principle • Select that tree that minimizes the total tree length = being the number of nucleic acid substitutions or amino acid replacements required to explain a given set of data. Method • a particular topology is considered • for this topology, the ancestral sequences at each branching point are reconstructed • the minimum number of events to explain the sequence differences over the whole tree is computed: the minimum number of substitutions is computed for each nucleotide (or amino acid) site, and the numbers for all sites are added. • another tree topology is chosen

  16. Maximum parsimony

  17. Maximum parsimony • Exhaustive search impossible • Heuristics needed

  18. Maximum parsimony • Find different tree topologies that are 'equally parsimonious‘ • Represent results as a consensus tree. • 'strict' consensus tree • 'majority-rule' consensus tree

  19. Maximum parsimony Only informative sites of the alignment are used in the construction of the tree: when there are at least two different kinds of characters, each represented at least two times

  20. Maximum parsimony Parsimony trees are usually only represented as a tree topology (cladogram): sometimes, the parsimony program cannot decide in which branches the substitutions have been taken place. It can not calculate branch lengths.

  21. Maximum parsimony Assumptions • Equal rate of evolution in all branches Advantages • sequence information is not reduced to one number (such as for example in pairwise distance methods) Disadvantages of maximum parsimony methods • can be slow for very large datasets • no correction for multiple mutations, i.e. no substitution model can be applied (see further) • sensitive to unequal rates of evolution in different lineages (see further) =>long branch attraction (voorbeeld hiervan?)

  22. Pairwise distance methods • Distance calculation • Inferring the tree topology

  23. Pairwise distance methods Distance calculation Approach: • align pairs of sequences and count the number of differences (Hamming distance). • For an alignment of length N with n sites at which there are differences: D= (n/N*100). Problem: • observed differences <> actual genetic distances between the sequences. => dissimilarity is an underestimation of the true evolutionary distance, because of the fact that some of the sequence positions are the result of multiple events Solution: • Use an evolutionary model that corrects for multiple mutations

  24. Pairwise distance methods Distance calculation

  25. Pairwise distance methods Distance calculation

  26. Pairwise distance methods Distance calculation Other evolutionary models

  27. Pairwise distance methods Tree inference: UPGMA • Ultrametric trees are rooted trees, in which all the endnodes are equidistant from the root of the tree, • Assuming a molecular clock: i.e, that all sequences evolve at a similar rate

  28. Pairwise distance methods Tree inference: UPGMA • when two OTUs are grouped, we treat them as a new single OTU • when OTUs A, B (which have been grouped before) and C are grouped into a new node ‘u’, then the distance from node ‘u’ to any other node ‘k’ (e.g. grouping D and E) is simply computed as follows:

  29. Pairwise distance methods Tree inference: UPGMA

  30. Pairwise distance methods Tree inference: UPGMA Advantages: • Fast • Allows incorporation of evolutionary models Disadvantages: • Assumption of a molecular clock

  31. Pairwise distance methods Tree inference: neighbor joining • Additive distances can be fitted to an unrooted tree such that the evolutionary distance between a pair of OTUs equals the sum of the lengths of the branches connecting them, rather than being an average as in the case of cluster analysis • Tree construction methods: minimum evolution, the tree that minimizes the sum of the lengths of the branches is regarded the best estimate of the phylogeny • Drawback for the ME method: is that in principle all different tree topologies have to be investigated in order to find the ‘minimum’ tree. • The neighbour joining (NJ) method, developed by Saitou and Nei (1987) offers a heuristic approach to solve this problem

  32. Pairwise distance methods Tree inference: neighbor joining

  33. Pairwise distance methods Tree inference: neighbor joining

  34. Pairwise distance methods Tree inference: neighbor joining

  35. Pairwise distance methods Tree inference: neighbor joining Advantages: • Fast • Allows incorporation of evolutionary models • No assumption of a molecular clock

More Related