560 likes | 772 Views
COMPUTATIONAL MODELS FOR PHYLOGENETIC ANALYSIS K. R. PARDASANI DEPTT OF APPLIED MATHEMATICS MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY (MANIT) BHOPAL - 462007. Phylogenetic Analysis.
E N D
COMPUTATIONAL MODELS FOR PHYLOGENETIC ANALYSIS • K. R. PARDASANI • DEPTT OF APPLIED MATHEMATICS • MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY (MANIT) • BHOPAL - 462007
Phylogenetic Analysis • From a given set of sequences, it should be possible to reconstruct the evolutionary relationships i.e. ancestral relationships, among genes and among organisms. • Phylogenetic analysis involves creating a branching or tree structure, termed as phylogeny, which illustrates the relationship between sequences. • A phylogenetic analysis of a family of related nucleic acid or protein sequences is a determination of how the family might have been derived during evolution.
PhylogeneticTrees • Sequence alignment methods lead to identification of similar sequences, multiple sequence alignment methods are applied to a set of related sequences before a phylogenetic analysis can be performed. • It seems logical to reconstruct the evolutionary/ancestral relationships among the genes and among the organisms from a given set of sequences. • This involves creating a branching structure called phylogeny or tree that illustrates the relationships between the sequences.
Basics of Trees • A tree is a 2-Dimensional graph showing evolutionary relationships among organisms or in certain genes from separate organisms. • These separate source of sequences referred as taxa (taxon - singular), defined as phylogenetically distinct units on the tree. • Tree is composed of nodes representing the taxa and branches representing the relationships among the taxa.
Basic Properties of Trees • The root is the common ancestor of all taxa. • If we do not have taxa to define the root, we can predict relationships by an uprooted tree. • Leaves represent things like genes, species being compared. • Paralogous are genes that diverged within the same species. • Orthologous are genes that diverged with species.
Rooted & Unrooted Trees • In rooted trees a single node is designated as a common ancestor, and a unique path leads from it through evolutionary time to any other node. • In a rooted tree, path from root to a node represents an evolutionary paths. • An unrooted tree specifies relationships among things, but not evolutionary paths. • Unrooted trees only specify the relationship between nodes and say nothing about the direction in which evolution occurred. • Roots can usually be assigned to unrooted trees through the use of an outgroup. • Outgroup– species that have unambiguously separated the earliest from the other species being studied.
Styles of Trees - I • Cladogram – Nodes are connected to other nodes and to tips by straight lines going directly from one to the other, and gives a V-shaped appearance. • Curvogram – Nodes are connected to other nodes and to tips by a curve which is one fourth of an ellipse, starting out horizontally and then curving upwards to become vertical. • Phenogram - Nodes are connected to other nodes and to other tips by a horizontal and then by a vertical line. This gives a precise idea of horizontal levels.
Styles of Trees - II • Eurogram – So-called because it is a version of cladogram diagram popular in Europe. Nodes are connected to other nodes and to tips by a diagonal line that goes outward and goes at most one-third of the way up to the next node, then turns sharply upwards and is vertical. • Swoopogram – connects two nodes or a node and a tip using two curves that are actually each one-quarter of an ellipse. The first part starts out vertical and then bends over to become horizontal. The second part starts out horizontal and then bends up to become vertical.
Steps in Phylogenetic analysis • In general it is a four step method – • Alignment strategy. • Determination of the substitution model. • Tree building. • Tree evaluation.
Methods of phylogenetic analysis • Distance Matrix Methods (MD) • Methods of calculation of distance matrices • The Neighbor-joining method (NJ) • The Fitch / Margoliash method • UPGMA • Character Based Methods • Maximum Parsimony (MP) • Maximum Likelihood (ML)
Distance Matrix Methods (MD) • Methods of calculation of distance matrices – • DNA distance matrices are calculated such that each mismatch between two sequences adds to the distances. • The simplest scoring method is of Jukes and Cantor, in which all possible nucleotide substitutions are of equal value. • This model also assumes that each base will eventually have the same frequency in DNA sequences once equilibrium has been reached.
2. Un-weighted-pair-group method with Arithmetic mean (UPGMA) • The oldest and simplest distance matrix method for tree reconstruction. • The un-weighted-pair-group method with arithmetic mean is largely statistically based and like all distance-based methods requires data that can be condensed to a measure of genetic distance between all pairs of taxa being considered.
UPGMA The UPGMA method requires a distance matrix such as one that might be created for a group of four taxa called A, B, C, D. Assume that the pairwise distances between each of the taxa are given in tha folloing matrix – Here dAB represents the distance between species A and B, while dAC is the distance between taxa A and C, and so on.
UPGMA • UPGMA begins by clustering the two species with the smallest distance separating them into a single, composite group. Assume that the smallest value in the distance matrix corresponds to dAB in which case species A and B are the first to be grouped (AB). • After the first clustering, a new distance matrix is computed with the distance between the new group (AB) and species C and D being calculated as – • d(AB)C =1/2(dAC + dBC) and • d(AB)D =1/2(dAD + dBD) • The process is repeated until all the species have been grouped.
3. THE NEIGHBOR-JOINING METHOD The Neighbor-Joining method begins by choosing the two most closely-related sequences, and then adding the next most distant sequence as a third branch to the tree. A tree with 3 sequences A,B, and C and the distances between nodes x, y, and z is shown here -
THE NEIGHBOR-JOINING METHOD Simultaneous linear equations can be used to calculate the branch lengths – A to B: x+y = 24 A to C: x+z = 28 B to C: y+z = 32 Thus with 3 equations and 3 unknowns we can calculate that x=10, y=14, and z= 18.
4. The Fitch / Margoliash method • The Neighbor-Joining method attempts to build only one tree. However, the raw pairwise distances may not always be perfectly additive. • Fitch and Margolish showed that different sets of internal branch lengths could be obtained by considering alternate trees which moved one or more branches to different parts of the tree. • Consider a distance matrix for 4 sequences with pairwise distances Dij :
The Fitch / Margoliash method • If we recalculate the pairwise distances dij from the tree, they are different from the original distances: For each tree considered, a different matrix of distances will be generated (dij). The best tree is defined as that tree which minimizes:
Character Based Methods • Maximum Parsimony (MP) – Character methods such as MP attempt to reconstruct mutational events leading to the currently observed sequences. The most parsimonious tree is therefore that tree which requires fewer mutational steps to visit each node.
The output from the PHYLIP DNAPARS program lists 3 most parsimonious trees, one such tree is -
Maximum Likelihood (ML) • The term maximum likelihood does not refer to a single statistical method, but rather to a general approach. • ML methods in their simplest form begin by listing all possible models, and then calculating the probability that each model would generate the data actually observed. • The model with the highest probability of generating the observed data is chosen as the best model.
Methods of Phylogenetic Evaluation All phylogenetic trees represent hypotheses regarding the evolutionary history of the sequences that makeup a data set. Like any good hypothesis, it is reasonable to ask two questions about how well it describes the underlying data – • How much confidence can be attached to the overall tree and its component parts i.e. branches ? • How much more likely is one tree to be correct than a particular or randomly chosen alternative tree ?
Methods of Phylogenetic Evaluation It is important to remember that the output from Phylogenetic analysis is one answer obtained using one set of conditions. The input data may simply not be robust i.e. data itself may contain more noise than evolutionary signal. Two methods of Phylogenetic evaluations are – • Jumbling Sequence Addition Order • Bootstrapping
Jumbling Sequence Addition Order • The simplest way to test a phylogeny is to repeat the analysis several times with different addition orders. • All PHYLIP programs and most other phylogeny programs have an option called JUMBLE, that uses a random number generator to choose which sequence to add at each step, rather than adding them in the order in which they appear in the file. • It is important to remember the order in which sequences appear in a file. Non-random sequence order might introduce a bias into the data set. • Therefore, even when doing only one run on a phylogeny, it is probably a good idea to jumble the order of sequences.
Bootstrapping • When sequences are short or polymorphism is minimal, we can have little confidence that the tree inferred from that data is the correct one. • The more is the data, the less likely it is for an artifactual phylogeny to be produced. • This method is based on the assumption that the statistical properties of a sample should be similar to the statistical properties of the population from which that sample was drawn. • The large the sample, the more representative it should be of the population.
Bootstrapping • In a physical sense the process is equivalent to taking the print out of a multiple alignment, cutting it up into pieces, each of which contains a different column from the alignment; placing all those pieces in to a bag; randomly reaching in to the bag and drawing out a piece. • Copying down the information from that piece before returning it to the bag; then repeating the drawing step until an artificial data set has been created that is as long as the original alignment. • The whole process is repeated to create hundreds or thousands of resampled data sets, and portions of the inferred tree that have the same groupings in many of the repetitions are those that are especially well supported by the entire original data set.
Bootstrapping • Bootstrap resampling is sampling with replacement. In the case of a MSA, sites are sampled at random until the data set is equal in length to the original alignment. • In each of the bootstrapped replicates, most sites are sampled once, some are sampled twice and a small number of sites are sampled three times. Some sites are never sampled. • For Bootstrap resampling of a sequence alignment, it is best to create at least 100 bootstrapped datasets, and redo the phylogeny for each one. • The one major disadvantage of Bootstrap resampling is that it drastically increases the time required to construct a phylogeny.
Assumptions of multiple alignment process • All sequences are homologous. • No duplicate sequences are present. • In each column, amino acid residues are homologous. • The alignment is optimal, with minimal gaps. Assumptionsof phylogenetic analysis process • All sequences are homologous. • No duplicate sequences are present. • In each column, amino acid residues are homologous. • The alignment is optimal, with minimal gaps. • No back mutation has occurred. • All sequences are of the same length.
GENE PREDICTION • The objective of gene prediction is to identify regions of genomic DNA that encodes proteins. • Gene prediction programs are used to search through new sequences and then annotate the sequence database entry with this information. • The annotation includes gene structure, gene location and any matches of the translated exons with protein sequence database.
Basis of Gene Prediction • Introns are flanked by donor and acceptor sites GT and AG – however, such pairs should each happen by chance every 42 = 16 bases. • Genes start with ATG and end with a stop codon (TAA,TAG, or TGA) – however, such codons should happen every codons. • The length of all coding regions must be a multiple of three – however coding regions can be split over multiple exons. • The distribution of base triplets and heximers differs between coding and non-coding regions.
Problems which complicate gene prediction • Gene transfer mechanism often introduce extra copies of genes in to genomes. • Sequencing errors can step on donor/acceptor sites and cause apparent frame shifts. • Genes can overlap each other. • Exons can be separated by several thousand bases. • Exons can be assembled in multiple ways through alternative splicing.
METHODS OF GENE REDICTION • Laboratory-based approaches • Feature-based approach • Homology-based approach • Statistical and HMM-based approaches
Laboratory-based approaches • This is the traditional way to find a gene was to do it in the laboratory. Experimental procedures for locating genes in new DNA are basically of three types: • Identification via hybridization to mRNA or cDNA. • Identification of the 5’-end and intron-exon junctions of the gene. • Exon trapping.
Identification via hybridization to mRNA or cDNA • Northern Blots • Here mRNA is run out on the Gel. • Transcripts resulting from expression of a gene can be detected and isolated to any given new DNA sequence. • This methodology can also be used to distinguish exons from introns by appropriate probe construction. • Zoo Blots • The DNA probe comes from an intragenic region. • Both organisms encode homologous proteins that probably execute similar functions. Thus Zoo Blots provide both gene location information as well as predictive gene from function information.
Identification of the 5’-end and intron-exon junctions of the gene Here S1 nuclease mapping and primer extension is applied: • In S1 nuclease mapping a DNA probe labeled at its 5’-end or the 5’-end of an exon is hybridized to the gene DNA. • S1 nuclease is used to digest the single stranded DNA. • In primer extension, a DNA probe labeled at its 5’-end and which is contained within the gene is hybridized to mRNA from the gene. • The probe DNA is used as a primer for a Reverse Transcriptase, which will extend this primer using the mRNA as Template.
Exon Trapping • This method is used to isolate exons from new DNA, rather than simply to identify exon-intron boundaries. • Here an R.fragment from a new DNA sequence is cloned into a cognate R.site in an intron of a cloned Gene. • This DNA is introduced in to an appropriate eukaryotic host, usually Yeast, and the cloned gene is expressed.
Statistical and HMM-based approaches • GCG provides several tools that help to identify protein coding sequences by statistics that measure codon usage (CODON PREFERENCE) and the non-random use of particular nucleotides in the third position of each codon (ESTCODE). • Some learning base approaches like Neural Networks and Hidden Markov Models (HMMs), can work surprisingly well , often better than handcrafted programs if the task is sufficiently fuzzy. • HMM gene and GeneMark are popular gene recognition programs based on HMMs.
GENE PREDICTION TOOLS There are many gene prediction tools available – • GRAIL • GenLang • BCM GeneFinder • FGENES • GENSCAN • Procrustes • GeneParser
GRAIL (Gene Recognition and Analysis Internet Links ) • It is perhaps the most widely used ORF identification tool. It was also one of the first to be made available. • It provides analysis of protein coding potential of a DNA Sequence. • GRAIL finds about 91% of all coding regions with an apparent false positive rate of 8.6%. • Three further refinements of the program – GRAIL 1a, GRAIL II and GRAIL-EXP have made the appearance after the original program.
Gene Mapping • Like geographical maps, ,maps of target DNA molecules give locations of ‘landmarks’. There are two main classes of mapping techniques: Genetic Mapping and Physical Mapping. • Genetic Mapping – it is based on statistical analysis of inheritance patterns, and gives a relative ordering of markers used. • Physical Mapping – it is based on physical and biochemical experiments on DNA sequences.
Applications of Mapping • DNA fingerprinting – a fingerprint of a DNA sequence is a pattern of size-fractionated segments. Such a pattern can be compared with fingerprints of different DNA molecules to determine sequence similarities. The steps to fingerprinting a given DNA segment include: • Clone the segment. • Completely digest it with chosen enzyme. • Separate and measure the resulting fragments using gel electrophoresis.
Applications of Mapping 2.Clone mapping – sequencing extremely large chunks of DNA is a laborious and impractical task. Divide-and-conquer strategy is employed to break the problem in to pieces of manageable size using a clone library. The steps to creating a map of DNA clones include: • Construct a clone library • Fingerprint each clone • Computationally infer the arrangement of clones, based on the fact that clones that overlap significantly should have features in common.
Applications of Mapping 3.STS Mapping – • A Sequence-tagged-site (STS), is a reference of about 200 nucleotides in length. • It is believed that STS occurs exactly once on the entire genome. • STS’s are used as markers on physical maps. • They are considered landmarks for locating other interesting sites.