900 likes | 1.56k Views
Molecular Phylogeny and Evolution. CISC 4020 Bioinformatics Spring 2012 Department of Computer and Information Science. Outline. Introduction to Evolution and Phylogeny Phylogenetic Tree Five stages of phylogenetic analysis. Evolution.
E N D
Molecular Phylogeny and Evolution CISC 4020 Bioinformatics Spring 2012 Department of Computer and Information Science
Outline • Introduction to Evolution and Phylogeny • Phylogenetic Tree • Five stages of phylogenetic analysis CISC 4020 Bioinformatics
Evolution • Charles Darwin’s 1859 book (On the Origin of Species By Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life) introduced the theory of evolution. • Groups of organisms change over time so that descendants differ structurally and functionally from their ancestors. CISC 4020 Bioinformatics
Natural Selection • To Darwin, the struggle for existence induces a natural selection. Offspring are dissimilar from their parents (that is, variability exists), and individuals that are more fit for a given environment are selected for. In this way, over long periods of time, species evolve. CISC 4020 Bioinformatics
Molecular Evolution • The study of changes in genes and proteins throughout different branches of the tree of life. • At the molecular level, evolution is a process of mutation with selection. • Data from present-day organisms are studied to reconstruct the evolutionary history of species. CISC 4020 Bioinformatics
Phylogeny • The inference of evolutionary relationships. • Traditionally, phylogeny relied on the comparison of morphological features between organisms. • Today, molecular sequence data are also used for phylogenetic analyses. CISC 4020 Bioinformatics
Molecular Phylogeny • The study of the evolutionary relationships among organisms or among molecules using the techniques of molecular biology. • A true tree depicts the actual, historical events that occurred in evolution – it is impossible to generate such a tree. • We generate inferred trees, which depict a hypothesized version of the historical events, with the help of Multiple Sequence Alignments (MSA) of protein or DNA/RNA. CISC 4020 Bioinformatics
Goals of molecular phylogeny • One object of molecular phylogeny is to deduce the correct trees for all species of life. • Analyzing molecular sequence data that define families of genes and proteins. • Another object is to infer or estimate the time of divergence between organisms since the time they last shared a common ancestor. CISC 4020 Bioinformatics
Molecularclock hypothesis • The hypothesis of a molecular clock: • For every given gene or protein, the rate of molecular evolution is approximately constant in all evolutionary lineages. • The average rates of changes are distinctly different for each protein family. CISC 4020 Bioinformatics
Molecularclock hypothesis • Implications: If protein sequences evolve at constant rates, they can be used to estimate the times that sequences diverged. This is analogous to dating geological specimens by radioactive decay. • Examples of divergence time estimated: • Beta and Delta globins occurred 44 MYA. • Beta and Gamma globins : 260 MYA. • Alpha and Beta globins: 565 MYA. CISC 4020 Bioinformatics
Positive and negative selection • Darwin’s theory of evolution suggests that, at the phenotypic level, traits in a population that enhance survival are selected for, while traits that reduce fitness are selected against. • For example, among a group of giraffes millions of years in the past, those giraffes that had longer necks were able to reach higher foliage and were more reproductively successful than their shorter necked group members, that is, the taller giraffes were selected for. CISC 4020 Bioinformatics
Positive and negative selection • In the mid-20th century, a conventional view was that molecular sequences are routinely subject to positive (or negative) selection. • Positive selection occurs when a sequence undergoes significantly increased rates of substitution, while negative selection occurs when a sequence undergoes change slowly. Otherwise, selection is neutral. CISC 4020 Bioinformatics
Neutral theory of evolution • An often-held view of evolution is that just as organisms propagate through natural selection, so also DNA and protein molecules are selected for. • According to Motoo Kimura’s 1968 neutral theory of molecular evolution, the vast majority of DNA changes are not selected for in a Darwinian sense. The main cause of evolutionary change is random drift of mutant alleles that are selectively neutral (or nearly neutral). Positive Darwinian selection does occur, but it has a limited role. CISC 4020 Bioinformatics
Neutral theory of evolution • The existence of a molecular clock makes sense in the context of the neutral hypothesis because most amino acid substitutions are neutral. • Substitutions are tolerated by natural selection to change in a manner that has clock-like properties. • If substitutions occurred primarily in the context of positive or negative selection, it is unlikely that they could account for clock-like evolution. CISC 4020 Bioinformatics
Outline • Introduction to Evolution and Phylogeny • Phylogenetic Tree • Five stages of phylogenetic analysis CISC 4020 Bioinformatics
2 A F 1 1 G B 2 I H 2 C 1 D 6 E time Phylogenetic Tree • The technique of molecular biology for studying evolutionary relationships among organisms using molecular sequence data – DNA or protein. • A phylogenetic tree is a graph composed of branches and nodes. CISC 4020 Bioinformatics
2 A F 1 1 G B 2 I H 2 C 1 D 6 E time Tree nomenclature Node (intersection or terminating point of two or more branches) branch (edge) A 2 1 1 B 2 C 2 2 1 D 6 one unit E CISC 4020 Bioinformatics
2 A F 1 1 G B 2 I H 2 C 1 D 6 E time Node of Tree - Taxon A taxonomic category or group, such as family, and species. taxon taxon A 2 1 1 B 2 C 2 2 1 D 6 one unit E CISC 4020 Bioinformatics
2 A F 1 1 G B 2 I H 2 C 1 D 6 E time Leaf Node of Tree - OTU operational taxonomic unit (OTU) an extant taxon, such as a protein sequence that we analyze. A 2 1 1 B 2 C 2 2 1 D 6 one unit E CISC 4020 Bioinformatics
2 A F 1 1 G B 2 I H 2 C 1 D 6 E time Internal Node of Tree An inferred ancestor of the OTUs. A 2 1 B 2 C 2 2 1 D 6 one unit E CISC 4020 Bioinformatics
2 A F 1 1 G B 2 I H 2 C 1 D 6 E time Branch of Tree Branches are unscaled... Branches are scaled... A 2 1 1 B 2 C 2 2 1 D 6 one unit E …OTUs are neatly aligned, and nodes reflect time …branch lengths are proportional to number of amino acid changes CISC 4020 Bioinformatics
2 A F 1 1 G B 2 I H 2 C 1 D 6 E time Branch of Tree bifurcating internal node multifurcating internal node A 2 1 B 2 C 2 2 1 D 6 one unit E CISC 4020 Bioinformatics
Examples of multifurcation: failure to resolve the branching order of some metazoans and protostomes Rokas A. et al., Animal Evolution and the Molecular Signature of Radiations Compressed in Time, Science 310:1933 (2005), Fig. 1. CISC 4020 Bioinformatics
Tree nomenclature: clades Clade ABF (monophyletic group) : The common ancestor and its children. A 2 F 1 1 B G 2 I H 2 C 1 D 6 E time CISC 4020 Bioinformatics
Tree nomenclature 2 A F 1 1 G B 2 I H 2 C Clade CDH 1 D 6 E time CISC 4020 Bioinformatics
Tree nomenclature Clade ABF/CDH/G 2 A F 1 1 G B 2 I H 2 C 1 D 6 E time CISC 4020 Bioinformatics
Examples of clades Lindblad-Toh et al., Nature 438: 803 (2005), fig. 10 CISC 4020 Bioinformatics
Tree roots The root of a phylogenetic tree represents the common ancestor of the sequences. Some trees are unrooted, and thus do not specify the common ancestor. A tree can be rooted using an outgroup (that is, a taxon known to be distantly related from all other OTUs). CISC 4020 Bioinformatics
Tree nomenclature: roots past 9 1 5 7 8 6 7 8 2 3 present 4 2 6 4 5 3 1 Rooted tree (specifies evolutionary path) Unrooted tree (The direction of time is undetermined.) CISC 4020 Bioinformatics
Tree nomenclature: outgroup rooting past root 9 10 A homologous bacterial protein 7 8 7 9 6 8 2 3 2 3 4 present 4 6 Outgroup (used to place the root) 5 1 5 1 Rooted tree 5 human being myoglobin orthologs CISC 4020 Bioinformatics
Numbers of possible trees extremely large for >10 sequences Number Number of Number of of OTUs rooted trees unrooted trees 2 1 1 3 3 1 4 15 3 5 105 15 10 34,459,425 105 20 8 x 1021 2 x 1020 CISC 4020 Bioinformatics
Outline • Introduction to Evolution and Phylogeny • Phylogenetic Tree • Five stages of phylogenetic analysis CISC 4020 Bioinformatics
Five stages of phylogenetic analysis [1] Selection of sequences for analysis [2] Multiple sequence alignment [3] Selection of a substitution model [4] Tree building [5] Tree evaluation CISC 4020 Bioinformatics
Stage 1: Use of DNA, RNA, or Protein • For phylogeny, DNA can be more informative. • The protein-coding portion of DNA has synonymous and nonsynonymous substitutions. Thus, some DNA changes do not have corresponding protein changes. • A synonymous substitution does not result in a change in the amino acid that is specified. CISC 4020 Bioinformatics
Stage 1: Use of DNA, RNA, or protein • For phylogeny, DNA can be more informative. • Some substitutions in a DNA sequence alignment can be directly observed: single nucleotide substitutions, sequential substitutions, coincidental substitutions. CISC 4020 Bioinformatics
Substitutions in a DNA sequence alignment can be directly observed, or inferred CISC 4020 Bioinformatics
Stage 1: Use of DNA, RNA, or protein • For phylogeny, DNA can be more informative. • Noncoding regions (such as 5’ and 3’ untranslated regions) may be analyzed using molecular phylogeny. CISC 4020 Bioinformatics
Stage 1: Use of DNA, RNA, or protein • For phylogeny, protein sequences are also often used. • Proteins have 20 states (amino acids) instead of only four for DNA, so there is a stronger phylogenetic signal. • Amino acid sequences are partially ordered character states: there is a variable number of states between the starting value and the final value. Nucleotides are unordered characters: any one nucleotide can change to any other in one step. CISC 4020 Bioinformatics
Five stages of phylogenetic analysis [1] Selection of sequences for analysis [2] Multiple sequence alignment [3] Selection of a substitution model [4] Tree building [5] Tree evaluation CISC 4020 Bioinformatics
Stage 2: Multiple sequence alignment • Confirm that all sequences are homologous • Adjust gap creation and extension penalties as needed to optimize the alignment • Restrict phylogenetic analysis to regions of the multiple sequence alignment for which data are available for all taxa (delete columns having incomplete data or gaps). CISC 4020 Bioinformatics
open circles: positions that distinguish myoglobins, alpha globins, beta globins 100% conserved gaps CISC 4020 Bioinformatics
Five stages of phylogenetic analysis [1] Selection of sequences for analysis [2] Multiple sequence alignment [3] Selection of a substitution model [4] Tree building [5] Tree evaluation CISC 4020 Bioinformatics
Stage 3: Models of DNA and Amino Acid Substitution • The simplest approach to defining the relatedness of a group of nucleotide (or amino acid) sequences is to align pairs of sequences, and then to count the number of differences. • The degree of divergence is called the Hamming distance. For an alignment of length N with n sites at which there are differences, the degree of divergence D is: D = n / N • But observed differences do not equal genetic distance! Genetic distance involves mutations that are not observed directly. CISC 4020 Bioinformatics
3 4 4 3 D = (- ) ln (1 – p) Stage 3: Models of DNA and Amino Acid Substitution Jukes and Cantor (1969) proposed a corrective formula: p is the proportion of residues that differ. This model describes the probability that one nucleotide will change into another. It assumes that each residue is equally likely to change into any other (i.e. the rate of transversions equals the rate of transitions). In practice, the transition is typically greater than the transversion rate. CISC 4020 Bioinformatics
Models of nucleotide substitution transition A G transversion transversion T C transition CISC 4020 Bioinformatics
Jukes and Cantor one-parameter model of nucleotide substitution (a=b) a A G a a a a T C a CISC 4020 Bioinformatics
a A G b b b b T C a Kimura model of nucleotide substitution (assumes a ≠ b) More Weight is given to Transversion for causing nonsynonymous changes in protein-coding regions. CISC 4020 Bioinformatics