Clustering, Phylogenetic Trees, and Inferences about Evolution

Clustering, Phylogenetic Trees, and Inferences about Evolution BMMB597E Protein Evolution

Given a set of organisms: • Can we measure similarities, and cluster the organisms into subsets? • Can we form hierarchical clusterings (that is, clusters of clusters of clusters …) that correspond to an evolutionary tree? • Can we calibrate rates of divergence, and thereby date branching events during life history?

Note that • We can observe similarities among organisms, or species, both among extant organisms; and, with greater difficulty, extinct ones • It is rare that we can observe evolutionary relationships directly. Generally evolutionary relationship (homology) is an inference from similarities that we can observe • Some dating can be calibrated from geology. However, much dating depends on models and assumptions, and is therefore questionable

What is a cluster? • Given a set of objects (species, people, literary texts, protein structures, minerals …) • A cluster is a subset of these objects such that the similarity among the objects in the subset is generally higher than the similarity among the objects in the full set • Clustering depends on property chosen to measure similarity • For instance, focussing on wings would cluster bats with birds; not separate mammals and birds

Linnaeus’ SystemaNaturae • Linnaeus (1707-1778) developed a taxonomic system for species • Based on clustering together species with similarities • Main clusters called Kingdoms • Animal, vegetable, mineral • Within each main cluster are subclusters • Hierarchical clustering: clusters of clusters of clusters …

http://www.mun.ca/biology/scarr/139416_Natural_classification.jpghttp://www.mun.ca/biology/scarr/139416_Natural_classification.jpg

Linnaean hierarchy • Kingdom • Class • Order • Genus • Species • Linnaeus introduced binomial nomenclature: genus/species: For example Homo sapiens, Bostaurus • Higher levels are implied; that is: humans and cows are mammals

Linnaean hierarchy • Kingdom • Class • Order • Genus • Species • Linnaeus introduced binomial nomenclature: genus/species: For example Homo sapiens, Bostaurus • Higher levels implied; that is humans and cows are mammals Titian: Rape of Europa

Taxonomy now has more levels of clustering • Kingdom • Phylum • Class • Order • Family • Genus • Species There are many intermediate levels also: superfamily, subfamily Below species: variety, strain

Objective and subjective aspects of clustering • We have already mentioned the problem of which characters to choose on which to base measurements of similarity. • Even if people agree on the degrees of similarity among element of a set of objects, they may disagree on how finely to cluster them • People are called “lumpers” or “splitters” • To a music major, all chemistry courses one cluster • To a chemistry major: important distinction between physical, analytical, inorganic, organic, biochemical

Linnaeus and evolution • When Linnaeus created his taxonomy, it was based solely on his perceived similarities among species • It turned out that the hierarchy largely reflects evolutionary relationships • All the creatures within the same genus or family should be more closely related to each other than they are to creatures in different genuses or families. • Usually true, although Linnaean hierarchy does not always correspond to modern taxonomy

Linnaeus v. Huxley • Linnaeus divided the animal kingdom into six classes: mammals, birds, amphibia (including reptiles), fishes, insects and worms. • Linnaeus therefore considered crocodiles more closely related to salamanders than to birds. • Thomas Huxley, in the 19th century, grouped reptiles and birds together. This is now believed to be correct . • There are, however, much more serious problems in the relation between taxonomy and evolution

Similarities are not relationships • Forming hierarchical clusters on the basis of similarities do not necessarily imply biological relationships • Choice of characters with which to measure similarities often ambiguous • Classical methods: palaeontology, comparative anatomy, embryology, in hands of experts, did extremely well • Molecular methods (especially DNA and protein sequences) perhaps more reliable

Early attempts to use molecular properties in taxonomy • Especially important for prokaryotes, where standard properties such as skeletal anatomy not possible • Nature of biochemicals – chemotaxonomy • Immunological cross-reactivity • Electrophoretic ‘fingerprinting’ – spread proteome out on a gel • Hybridization of DNAs

Genotype and phenotype • Evolutionary relationships fundamentally based on genotype • Palaeontology, comparative anatomy, embryology attempt to reason from phenotype to genotype • So sequence-based methods more direct • However, sequences don’t always give unambiguous answer

The species as the ‘atom’ of taxonomy • Taxonomy has been fundamentally the classification of species. • Remember that before Darwin, it was believed that species were immutable • We are still interested in evolutionary trees of species • But it has become clear that it is more difficult to define the concept of species

Difficulty in defining species • At base of hierarchy is idea of species. • It is species that Linnaeus and subsequent taxonomists are trying to cluster. • Note: subspecies, varieties, strains … • Difficult to define species • Mutual fertility within group, and infertility outside group, is a major conceptual ingredient • But even for mammals this doesn’t quite work: there are mutually fertile species (tigers, lions) that do not mate in the wild A “tiglon”

Whole concept of hierarchy in question • Horizontal gene transfer: incorporation of genetic material from another organism that is not a parent • Example: plasmid exchange among bacteria to distribute antibiotic resistance • Known mechanisms: • Transformation (Avery, McLeod and McCarty, 1944: proof DNA genetic material) • Transduction: virus carries DNA from one organism to another (bacteria, human retroviruses) • Bacterial conjugation: DNA transfer by cell-cell contact

From: “Studies on the Chemical Nature of the Substance Inducing Transformation of Pneumococcal Types: Induction of Transformation by a Desoxyribonucleic Acid Fraction Isolated from Pneumococcus Type III.” J. Exp. Med., 79: 137-158. January 1944. Bacterial conjugation. (Image by C; C. Brinton, Jr., http://biosciences-people.bham.ac.uk/About/staff_profiles_research.asp?ID=205) Bacteriophage infecting E. coli http://www.washington.edu/alumni/partnerships/biology/200710/kerr.html

Bacterial transformation http://slic2.wsu.edu Bacterial conjugation knowledgerush.com Bacteriophage infecting cell biology.about.com

Horizontal gene transfer makes nonsense of the “tree of life” Picture by W.F. Doolittle

To summarize • We want to construct a “tree of life” stating the genealogy of all organisms (or at least all species) • Classical methods based on phenotype not bad • Molecular data – especially DNA sequences – are based directly on genotype • Work as well as anything could • Still problems with horizontal gene transfer • These problems are worse in prokaryotes, worst in earliest life forms (see Doolittle’s picture).22

Remaining general problems • Once you choose measures of similarity you can derive a hierarchy • Still left with how you define clusters (lumper-splitter problem) • Whether this represents evolutionary relationships is a question – in view of HGT • Differential rates of change can complicate picture • Can we calibrate molecular similarities to date events in life history?

How do we represent hierarchies • Idea of graph: nodes and edges • A tree is a special kind of graph, in which there is only one path from any node to any other node This graph is a tree This graph is not a tree

http://www.genealogyintime.com/NewsStories/2009/April/inbreeding_of_spanish_royalty_page2.htmlhttp://www.genealogyintime.com/NewsStories/2009/April/inbreeding_of_spanish_royalty_page2.html Family `tree’ NOT a tree!

How to turn a set of pairwise similarities into a tree • UPGMA method (UnweightedPair Group Method with Arithmetic Mean) • Start by taking each item as a separate subset • Take most closely related pair, form a new node that is their parent • The original pair becomes a two-element subset associated with the higher node • Then take next most closely-related pair of subsets • Similarity/difference between two subsets is the average of the similarities/differences between all pairs of elements from the subsets

Number of sequence differences between cytochromes c Data from W.M. Fitch & E. Margoliash, Science 155, 279-284 (1967)

UPGMA tree from cytochrome c sequences First two steps: BF closest, join themThen AD next closest, join them http://www.nmsr.org/upgma.htm

UPGMA tree from cytochrome c sequences, subsequent steps http://www.nmsr.org/upgma.htm

UPGMA method is suitable for deriving hierarchy from any set of objects given a measure of similarity • Problems arise when trying to infer evolutionary relationships, and dating divergences • Consider items sold in different sections of a department store: • Reasonable that men’s and women’s shoes have a common ancestor • Not reasonable that shoes and furniture have a common ancestor

Can we trust UPGMA tree as reflecting evolution? • If choice of different similarity measures gives inconsistent results, then there is a problem • Typical similarity measures for molecular biology are sequence similarities, either nucleic acid or protein • Many different measures suggested, basic idea is that the more substitutions in optimal alignment, the more distant the sequences • Quantitatively, correct for back mutation in highly diverged sequences

UPGMA assumes constant divergence rates • Suppose there are species that are indeed closely related, but suppose that the cytochrome c of one of them is changing much faster than the other • Then that pair will appear very dissimilar and be separated in the phylogenetic tree • This is an error if we wish to assume that the similarities in the cytochromes c indicate the closeness of the evolutionary relationships

Defenses against non-uniform rates of change • Sometimes unusually large rate of change the result of selective pressure • Choose third-base changes as non selective? • Detection of non-uniform rates of change: choose ‘outgroup’ • For instance, if we are dealing with sequences from primates, choose another mammal: cow • Similarity of cow sequence to all primate sequences should be approximately equal • If not, some primate species is changing faster than others

Unrooted and rooted trees A rooted tree contains one more bit of information: what node in the graph corresponds tao the last common ancestor Inclusion of an outgroup can allow ‘rooting’ of the tree

Cladistic methods • Explicitly assume evolutionary relationship and evolutionary model • Deal specifically with sequences • Start from multiple sequence alignment • Two classical methods: • Maximum parsimony • Maximum likelihood

Maximum parsimony • Find tree that postulates fewest mutations • Given sequences: ATGC, ATGG, TCCA, TTCA (These appear on bottom line of trees) ● Tree on left postulates four mutations ● Tree on right postulates eight mutations (T→A at position 1 occurs twice) ● Ancestral sequence at each node shown

How could you test a method for determining a phylogenetic tree? • Use real data: sequence samples of virus taken from same patient at different times • Use simulated data: set up a model of sequence change, write computer program to implement it, producing a known tree with known final generation of sequences – see whether methods correctly reproduce the tree

Time calibration of phylogenetic trees: the ‘Molecular Clock’ • ‘Molecular clock’ hypothesis (Pauling & Zuckerkandl, 1962): suggested that the rate of evolutionary change in the amino acid sequence of each protein family was approximately constant over time, independent of lineage • E. Margoliash (1963): ‘It appears that the number of residue differences between cytochrome C of any two species is mostly conditioned by the time elapsed since the lines of evolution leading to these two species originally diverged.’

Problems with molecular clock (F. Ayala) • Different generation times – should the ‘clock’ run at a constant rate per year or per generation? • Population size – genetic drift stronger in small populations, more of evolution is neutral in small populations • General species-specific differences • Functional change in protein studied – stick to non-coding (=??? non-functional??) DNA or silent mutations • Differential selective pressure

Calibration of molecular clock • Use dates of species divergence available from classical palaeontology • Dating by geological methods • If there are enough calibration points, then can interpolate • Some well known exceptions to constant rate of sequence divergence have arisen • For instance, the clock runs about 5 times as fast in rodents as in humans (generation time?)

Bayes’ theorem: P(A|B) = P(B|A) P(A) / P(B) • The theorem is telling us how to calculate P(A|B) = the conditional probability of A, given the observation B, on which A may depend. • For example: if A = rain and B can = cloudy or sunny, then • P(rain|cloudy) is the probability that it will rain on a cloudy day • P(rain|sunny) is the probability that it will rain on a sunny day (small but not zero) • P(rain|cloudy) is likely to be greater than P(rain|sunny) • P(A) = the prior probability of A. (Without checking the sky today, what is the probability that it will rain = the number of rainy days per year/365) • P(B) = the prior probability of B (number of cloudy days per year/365 or number of sunny days per year /365) • P(B|A) = the conditional probability of B, given A. In our example, P(cloudy|rain) = the probability that it is cloudy, if we know it is raining

P(A|B) = P(B|A) P(A) / P(B) • Suppose that in State College there are 66 rainy days per year and 299 (= 365 – 66) dry days • There are 100 cloudy days per year and 265 clear days • 95% of rainy days are cloudy; 5% of rainy days clear • We want to guess whether it will rain today • If we don’t look at the sky we can only estimate: 66/365 = 18% chance of rain • If we observe that it is cloudy, probability of rain is: P(rain|cloudy) = P(cloudy|rain)×P(rain)/P(cloudy) = 0.95 × 0.18 / (100/365) = 62.4% chance of rain

P(A|B) = P(B|A) P(A) / P(B) • If we don’t look at the sky we can only estimate: 66/365 = 18% chance of rain • If we observe that it is cloudy, probability of rain is: P(rain|cloudy) = P(cloudy|rain)×P(rain)/P(cloudy) = 0.95 × 0.18 / (100/365) = 62.4% chance of rain • Observation of a contingent quantity (cloudy sky) allows us to correct our a priori probability, 18%, to 62.4%

What if we don’t know the numbers? • In the simple example, we had a completely parameterized model and tried to predict an outcome • Alternatively we don’t have the statistics – they are ‘unknown parameters’ – and we observe the sky and the weather over many consecutive days. These are our data. • For any value of the parameters, we can calculate the probability of observing the data. • Those values of parameters that give the highest probability to the data actually observed are our estimate of their values

Bayesian inference of phylogenetic trees • Observable: multiple sequence alignment • What phylogenetic tree best accounts for this alignment • Trees depend on model of evolutionary change; a general model being specified by values of parameters such as mutation rate • For any model, can compute the probability of different observed sequence alignments • The likelihood of certain parameter values is the computed probability of observing the actual data, if the parameters have those values

More detailed description of parameters • Tree topology and branch lengths • nucleotide or amino acid frequencies • Substitution model parameters • transition/transversion ratio • substitution matrix such as BLOSUM62 • Ancestral sequences • We want to use the observed data to determine the parameters

Power of Bayesian methods • Allow for more complex models of evolutionary process • Avoid assumption of constancy of molecular clock along different branches • Allow determination of branching times and rates of evolution along different branches • Calculations can be done with a Markov-Chain-Monte-Carlo (MCMC) approach; this is an efficient way of optimising search in parameter space

Suggested reading Holder, M. & Lewis, P.O. (2003). Phylogeny estimation: traditional and Bayesian approaches. Nature Reviews Genetics 4, 275--284.

Clustering, Phylogenetic Trees, and Inferences about Evolution

Clustering, Phylogenetic Trees, and Inferences about Evolution

Presentation Transcript

Phylogenetic Trees

Phylogenetic Trees

PHYLOGENETIC TREES

Cladograms and Phylogenetic Trees

Phylogenetic Trees

Phylogenetic Trees

Phylogenetic Trees

Phylogenetic Trees

Phylogenetic Trees

Phylogenetic trees

Phylogenetic trees

Clustering of Phylogenetic Trees by Clique Partitioning

Phylogenetic Trees

Phylogenetic trees

Phylogenetic Trees

Terminology of phylogenetic trees Types of phylogenetic trees Types of Data Character Evolution

Phylogenetic trees

Phylogenetic Trees

Phylogenetic trees

Phylogenetic Trees: Assumptions

Phylogenetic Trees

Phylogenetic Trees