320 likes | 505 Views
COMPUTATIONAL MODELS FOR PHYLOGENETIC ANALYSIS K. R. PARDASANI DEPTT OF APPLIED MATHEMATICS MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY (MANIT) BHOPAL - 462007. Phylogenetic Analysis.
E N D
COMPUTATIONAL MODELS FOR PHYLOGENETIC ANALYSIS • K. R. PARDASANI • DEPTT OF APPLIED MATHEMATICS • MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY (MANIT) • BHOPAL - 462007
Phylogenetic Analysis • From a given set of sequences, it should be possible to reconstruct the evolutionary relationships i.e. ancestral relationships, among genes and among organisms. • Phylogenetic analysis involves creating a branching or tree structure, termed as phylogeny, which illustrates the relationship between sequences. • A phylogenetic analysis of a family of related nucleic acid or protein sequences is a determination of how the family might have been derived during evolution.
PhylogeneticTrees • Sequence alignment methods lead to identification of similar sequences, multiple sequence alignment methods are applied to a set of related sequences before a phylogenetic analysis can be performed. • It seems logical to reconstruct the evolutionary/ancestral relationships among the genes and among the organisms from a given set of sequences. • This involves creating a branching structure called phylogeny or tree that illustrates the relationships between the sequences.
Basics of Trees • A tree is a 2-Dimensional graph showing evolutionary relationships among organisms or in certain genes from separate organisms. • These separate source of sequences referred as taxa (taxon - singular), defined as phylogenetically distinct units on the tree. • Tree is composed of nodes representing the taxa and branches representing the relationships among the taxa.
Basic Properties of Trees • The root is the common ancestor of all taxa. • If we do not have taxa to define the root, we can predict relationships by an uprooted tree. • Leaves represent things like genes, species being compared. • Paralogous are genes that diverged within the same species. • Orthologous are genes that diverged with species.
Rooted & Unrooted Trees • In rooted trees a single node is designated as a common ancestor, and a unique path leads from it through evolutionary time to any other node. • In a rooted tree, path from root to a node represents an evolutionary paths. • An unrooted tree specifies relationships among things, but not evolutionary paths. • Unrooted trees only specify the relationship between nodes and say nothing about the direction in which evolution occurred. • Roots can usually be assigned to unrooted trees through the use of an outgroup. • Outgroup– species that have unambiguously separated the earliest from the other species being studied.
Styles of Trees - I • Cladogram – Nodes are connected to other nodes and to tips by straight lines going directly from one to the other, and gives a V-shaped appearance. • Curvogram – Nodes are connected to other nodes and to tips by a curve which is one fourth of an ellipse, starting out horizontally and then curving upwards to become vertical. • Phenogram - Nodes are connected to other nodes and to other tips by a horizontal and then by a vertical line. This gives a precise idea of horizontal levels.
Styles of Trees - II • Eurogram – So-called because it is a version of cladogram diagram popular in Europe. Nodes are connected to other nodes and to tips by a diagonal line that goes outward and goes at most one-third of the way up to the next node, then turns sharply upwards and is vertical. • Swoopogram – connects two nodes or a node and a tip using two curves that are actually each one-quarter of an ellipse. The first part starts out vertical and then bends over to become horizontal. The second part starts out horizontal and then bends up to become vertical.
Steps in Phylogenetic analysis • In general it is a four step method – • Alignment strategy. • Determination of the substitution model. • Tree building. • Tree evaluation.
Methods of phylogenetic analysis • Distance Matrix Methods (MD) • Methods of calculation of distance matrices • The Neighbor-joining method (NJ) • The Fitch / Margoliash method • UPGMA • Character Based Methods • Maximum Parsimony (MP) • Maximum Likelihood (ML)
Distance Matrix Methods (MD) • Methods of calculation of distance matrices – • DNA distance matrices are calculated such that each mismatch between two sequences adds to the distances. • The simplest scoring method is of Jukes and Cantor, in which all possible nucleotide substitutions are of equal value. • This model also assumes that each base will eventually have the same frequency in DNA sequences once equilibrium has been reached.
2. Un-weighted-pair-group method with Arithmetic mean (UPGMA) • The oldest and simplest distance matrix method for tree reconstruction. • The un-weighted-pair-group method with arithmetic mean is largely statistically based and like all distance-based methods requires data that can be condensed to a measure of genetic distance between all pairs of taxa being considered.
UPGMA The UPGMA method requires a distance matrix such as one that might be created for a group of four taxa called A, B, C, D. Assume that the pairwise distances between each of the taxa are given in tha folloing matrix – Here dAB represents the distance between species A and B, while dAC is the distance between taxa A and C, and so on.
UPGMA • UPGMA begins by clustering the two species with the smallest distance separating them into a single, composite group. Assume that the smallest value in the distance matrix corresponds to dAB in which case species A and B are the first to be grouped (AB). • After the first clustering, a new distance matrix is computed with the distance between the new group (AB) and species C and D being calculated as – • d(AB)C =1/2(dAC + dBC) and • d(AB)D =1/2(dAD + dBD) • The process is repeated until all the species have been grouped.
3. THE NEIGHBOR-JOINING METHOD The Neighbor-Joining method begins by choosing the two most closely-related sequences, and then adding the next most distant sequence as a third branch to the tree. A tree with 3 sequences A,B, and C and the distances between nodes x, y, and z is shown here -
THE NEIGHBOR-JOINING METHOD Simultaneous linear equations can be used to calculate the branch lengths – A to B: x+y = 24 A to C: x+z = 28 B to C: y+z = 32 Thus with 3 equations and 3 unknowns we can calculate that x=10, y=14, and z= 18.
4. The Fitch / Margoliash method • The Neighbor-Joining method attempts to build only one tree. However, the raw pairwise distances may not always be perfectly additive. • Fitch and Margolish showed that different sets of internal branch lengths could be obtained by considering alternate trees which moved one or more branches to different parts of the tree. • Consider a distance matrix for 4 sequences with pairwise distances Dij :
The Fitch / Margoliash method • If we recalculate the pairwise distances dij from the tree, they are different from the original distances: For each tree considered, a different matrix of distances will be generated (dij). The best tree is defined as that tree which minimizes:
Character Based Methods • Maximum Parsimony (MP) – Character methods such as MP attempt to reconstruct mutational events leading to the currently observed sequences. The most parsimonious tree is therefore that tree which requires fewer mutational steps to visit each node.
The output from the PHYLIP DNAPARS program lists 3 most parsimonious trees, one such tree is -
Maximum Likelihood (ML) • The term maximum likelihood does not refer to a single statistical method, but rather to a general approach. • ML methods in their simplest form begin by listing all possible models, and then calculating the probability that each model would generate the data actually observed. • The model with the highest probability of generating the observed data is chosen as the best model.
Methods of Phylogenetic Evaluation All phylogenetic trees represent hypotheses regarding the evolutionary history of the sequences that makeup a data set. Like any good hypothesis, it is reasonable to ask two questions about how well it describes the underlying data – • How much confidence can be attached to the overall tree and its component parts i.e. branches ? • How much more likely is one tree to be correct than a particular or randomly chosen alternative tree ?
Methods of Phylogenetic Evaluation It is important to remember that the output from Phylogenetic analysis is one answer obtained using one set of conditions. The input data may simply not be robust i.e. data itself may contain more noise than evolutionary signal. Two methods of Phylogenetic evaluations are – • Jumbling Sequence Addition Order • Bootstrapping
Jumbling Sequence Addition Order • The simplest way to test a phylogeny is to repeat the analysis several times with different addition orders. • All PHYLIP programs and most other phylogeny programs have an option called JUMBLE, that uses a random number generator to choose which sequence to add at each step, rather than adding them in the order in which they appear in the file. • It is important to remember the order in which sequences appear in a file. Non-random sequence order might introduce a bias into the data set. • Therefore, even when doing only one run on a phylogeny, it is probably a good idea to jumble the order of sequences.
Bootstrapping • When sequences are short or polymorphism is minimal, we can have little confidence that the tree inferred from that data is the correct one. • The more is the data, the less likely it is for an artifactual phylogeny to be produced. • This method is based on the assumption that the statistical properties of a sample should be similar to the statistical properties of the population from which that sample was drawn. • The large the sample, the more representative it should be of the population.
Bootstrapping • In a physical sense the process is equivalent to taking the print out of a multiple alignment, cutting it up into pieces, each of which contains a different column from the alignment; placing all those pieces in to a bag; randomly reaching in to the bag and drawing out a piece. • Copying down the information from that piece before returning it to the bag; then repeating the drawing step until an artificial data set has been created that is as long as the original alignment. • The whole process is repeated to create hundreds or thousands of resampled data sets, and portions of the inferred tree that have the same groupings in many of the repetitions are those that are especially well supported by the entire original data set.
Bootstrapping • Bootstrap resampling is sampling with replacement. In the case of a MSA, sites are sampled at random until the data set is equal in length to the original alignment. • In each of the bootstrapped replicates, most sites are sampled once, some are sampled twice and a small number of sites are sampled three times. Some sites are never sampled. • For Bootstrap resampling of a sequence alignment, it is best to create at least 100 bootstrapped datasets, and redo the phylogeny for each one. • The one major disadvantage of Bootstrap resampling is that it drastically increases the time required to construct a phylogeny.
Assumptions of multiple alignment process • All sequences are homologous. • No duplicate sequences are present. • In each column, amino acid residues are homologous. • The alignment is optimal, with minimal gaps. Assumptionsof phylogenetic analysis process • All sequences are homologous. • No duplicate sequences are present. • In each column, amino acid residues are homologous. • The alignment is optimal, with minimal gaps. • No back mutation has occurred. • All sequences are of the same length.