430 likes | 674 Views
Berend Snel, Martijn A. Huynen and Bas E. Dutilh Presented by Audrey No ël. Genome Trees and the Nature of Genome Evolution. Introduction - Multiple alignments. Most existing approaches for phylogenetic inference use multiple alignment
E N D
Berend Snel, Martijn A. Huynen and Bas E. Dutilh Presented by Audrey Noël Genome Trees and the Nature of Genome Evolution
Introduction - Multiple alignments • Most existing approaches for phylogenetic inference use multiple alignment • Assume a sort of an evolutionary model and show problems in computational complexity • Becomes misleading due to gene rearrangements, inversion, transposition and translocation • Do not directly apply on complete genomes where such events as rearrangements make traditional full length alignments impossible • Become insufficient for phylogenies using complete genomes
Introduction • Archaeal organisms appear to be close to Eukarya when the protein synthesis machinery is considered but close to Bacteria if metabolic genes are compared. This differences reflect problems in phylogenetic reconstruction due to • Horizontal transfer • Unequal rates of nucleotide substitution • Gene displacement • Scientists today look at the classification of organisms in a way that is different from the approach of just a few decades ago • Molecular technologies such as PCR and sequencing allow genetic observations that are more precise • Availability of multiple complete genome sequences requires the development of new phylogenetic approaches
Introduction • How can genomic information be used to obtain useful information concerning genome evolution? • Complete genome trees are less affected than phylogenies based on single genes by • Horizontal gene transfer • Paralogy • Highly variable rates of gene evolution • Misalignment
Lateral gene transfer Lead to phylogenies that are inconsistent with the species phylogeny Refers to the transfer of genes or genetic material directly from one individual to another by processes similar to infection Implies that genes can be transferred between distant species that would never interbreed in nature Horizontal gene transfer (HTG)
Horizontal gene transfer • Problems: • If a plant gets a gene from an Archea, when we will do a tree with this gene this plant will be close to Archea and not with other plants • Produces complex trees with criss-crossing branches and not a fan-shaped trees • HGT is in the minority of anomalous phylogenetic events observed in fully sequenced genomes • Gene loss and gene duplication give more frequent challenges to genome phylogeny • For archeal and bacterial genomes <15% of phylogenetically trouble events are from HGT
Definitions • Phylogeny : the origin and evolution of a set of organisms. It use the evolutionary distance as the main criterion for taxonomy • Phylogenetic\genome tree: a graphical representation showing the evolutionary relationships among taxonomic units. Taxonomic units : species, populations, individuals or genes • Branches of the tree are connected at ancestral taxonomic units (nodes) • Living units are the ends of the branches • Branch length represents the number of changes that have occurred
Rooted vs Unrooted • Rooted tree: • Directed tree with a unique node corresponding to the most recent common ancestor of all the entities at the leaves of the tree • Unrooted tree: • Tree derived from a rooted phylogenetic tree by omitting the root • It’s a forest of rooted phylogenetic trees
Definitions • Dichotomous tree: each node has exactly two descendants • Polytomous tree: each node has three or more descendants
A B C D E F Taxon • Taxon : Group with common attributes • Monophyletic taxon: is one which includes all the evolutionary descendants of the taxon's common ancestor and only those descendants • Ex : mammalian, birds, insects • Paraphyletic taxon: is one which includes descendants from only one ancestor, but not all of them • Ex : fish, invertebrates. • Polyphyletic taxon: is one descended from more than one ancestor • Ex : marine mammals, bipedal mammals, flying vertebrates, algae
Definitions • Homologs: similar sequences that have been derived from a common ancestor sequence • Orthologs: similar sequences in 2 different organisms that have arisen due to a speciation event • Paralogs: similar sequences within a single organism that have arisen due to a gene duplication event
5 classes of genome trees based on different aspects of genome • Alignment-free trees • Gene content trees • Chromosomal gene order trees • Average sequence similarity trees • Phylogenomics trees
Alignment-free trees • Based on statistic properties of the genome • Used 2 categories of methods: • Based on statistics of word frequency (DNA string) • Shared information
DNA string • Not rely on homology • Count the frequency of oligopeptide strings of a fixed length in the collection of the protein sequences • Results are combined in a word-frequency vector and the distance is defined in a Cartesian space • Angle between 2 vectors = distance between 2 genomes • Trees are construct using standard distance-based algorithms
DNA string : advantages • At the beginning there was comparison of G+C content or amino acid composition for the analysis of biological sequences • By extending single-nucleotide counting to longer strings, it increase the resolution power of the analysis • Does not contain free parameters • There was no choice of genes (no ambiguity) or no multiple alignment of sequences • Only parameter = the length of the oligopeptides
DNA string: disadvantages • Placement problems : related to small genome size • But applied to small chloroplast genomes alone = good results • This approach needs more justification and further study • Test it by including new complete genomes, especially those of Eukaryotes
Shared information • Algorithmic compression • Lempel-Ziv complexity • Identified the regularities in the given DNA sequence • These regularities would have biological implications • Distance between 2 genomes = length of the shortest computer program to output a, given input b
LZ : advantages • Able to perform comparisons at the whole genome level where multiple alignment method fail • Utilize the entire information contained in the sequences and require no human intervention • Unequal sequence length are not problematic
LZ : disadvantage • LZ compression substitutes the detected repeated patterns with references to a dictionary • The larger the dictionary, the greater the number of bits are necessary for the references
Alignment-free trees : applications • Construct phylogeny of the Eutherian (placental mammals) orders using complete unaligned mitochondrial genome • Consistent with the commonly accepted one • 109 organisms : 16 Archaea, 87 Bacteria, and 6 Eukarya • Unrooted tree that agrees with the biologists ‘‘tree of life’’
Genome trees based on shared gene content • Distances represents the fraction of shared orthologous genes between genomes • Use distance algorithms to construct the tree • neighbor joining • minimum evolution • Few horizontal transfer events or the events occur mainly between closely related species
Gene content : Genome size effect • Problem : a large genome can share more genes with other large genomes than he can do with his more closely related but smaller cousins • There is 2 ways to correct this effect • Divide the number of shared genes by the number of genes in the smaller genome, the latter representing the maximum number of genes the two genomes can share • Leaving out the small genomes
Gene content : applications • Divides 174 taxa into Archaea, Bacteria, and Eukarya • Sorts most of the major groups within these superkingdoms • Not every organism appears exactly at its classical phylogenetic position in these trees • Used 11 complete genomes of free-living microorganisms • Additional phylogenetic relationships appear to be resolved • Used clusters of orthologs group data to construct tree of herpesviruses • Tree agree well with those based on other methods • The tree is robust when tested by bootstrap analysis
Genome trees based on shared gene content : disadvantages • Things that contribute indirectly to the position of an organism • Genome size • Loss or acquisition of genes • The inclusion of small genomes, which may have undergone massive gene losses, may alter the genomic tree by the limitation imposed on the proportions of genes shared with common ancestry in other genomes
Trees based on gene order • Based on the position of genes in the chromosome or chromosomes that compose the genome of the analyzed species • Estimate evolutionary distance from the number of rearrangements necessary to transform one genome into another
Trees based on gene order : disadvantages • The gene order is well conserved in near species both for the prokaryotes and the eukaryotes • Because the transcription in the prokaryotes is done by operons => that some genes must stay together, so there is more conservation of the gene order in prokaryotes than eukaryotes
Gene order : application • 11 genomes of species belonging to the lactic acid bacteria • Tree do not provide much additional information about relationships among bacterial taxa compared to more traditional alignment based methods • Study can bring other kind of information like in determining which genes are shared, when genes were lost in evolutionary history, detect the presence of HTG • BUT the absence of conservation of gene order across the species makes this approach less suitable for comparing distantly related organisms By fermenting lactic acid, Oenococcus oeni plays a critical role in de-acidifying wine
Differences between gene content and gene order • Even if both correlates with evolutionary distance, gene order evolve faster • E. coli and H. influenzae (Gram-negative bacterium) share 78% of their genes, while their gene order is only conserved for 36% • Gene order tree showed some improbable higher order affiliations, reflecting a lack of resolution for these longer evolutionary distances in which too many gene rearrangements have occurred but gene content tree behaved normal for these distances Haemophilus influenzae
Genome trees based on average sequence similarity • Make BLAST comparisons with DNA sequences of each pairs of complete genome • BLAST (Basic Local Alignment Search Tool) • Program that receives a sequence as input and find in a data base all similar sequences • Build a similarity matrix in wich each cells represents the blast score (measure of similarity) between 2 genomes • The matrix is used by the neighbor joining method to build a tree, the 2 species with the best score will be put together and so on • Opposite of the other method because this method neglects any knowledge of orthology
Genome trees based on average sequence similarity : advantages • Straightforward to implement • Intermediate between gene content approach and sequence based approach
Genome trees based on average sequence similarity : disadvantages • They compare homologous genes rather than orthologous genes => introducing noise • A filter should be applied to reduce the impact of nonorthologous homologs • Researchers are reluctant to adopt the method because • Approach appear to combine the problems present in trees based on gene content and in trees based on sequence
Genome trees based on average sequence similarity : applications • Construct trees for completely sequenced bacterial and archaeal genomes. The resulting tree supports: • The separation of bacteria and archaea • Some terminal bifurcations within the bacterial and archaeal domains
Genome trees based on gene trees • Supertrees • Concatenated sequences
Supertrees • Currently the only phylogenetic method that can build complete phylogenies of very large clades (hundreds of species)
Conclusion • Reliable phylogenies help to understand : • The sequence of evolutionary events that generated present day diversity • The mechanisms of evolution as well as the history of organisms • Good applications can be done from whole genome data, but the approaches have to be yet improved!
References • Snel B., Huynen M. A., Dutilh B. E. 2005. Genome trees and the nature of genome evolution. ARI: 191-209. • Li M., Badger JH., Chen X., Kwong S., Kearney P., Zhang H. 2001. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17:149-54. • Vinga S., Almeida J. 2003. Alignment-free sequence comparison-a review. Bioinformatics 19:513-23. • Out HH., Sayood K. 2003. A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19:2122-30. • Qi J., Wang B., Hao B. 2004. Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach. J Mol Evol 58:1–11. • Qi J., Luo H., Hao B. 2003. CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Research 32 : W45–47. • Snel B., Bork P., Huynen M. A. 1999. Genome phylogeny based on gene content. Nature genetics 21 : 108-110. • Yang S., Doolittle R. F., Bourne P. E. 2004. Phylogeny determined by protein domain content. PNAS 102: 373–378. • Gu X., Zhang H. 2004. Genome Phylogenetic Analysis Based on Extended Gene Contents. Mol. Biol. Evol. 21:1401–1408. • Wolf Y., Rogozin I. B., Grishin N. V., Tatusov R. L., Koonin E. V. 2001. Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evolutionary Biology 1:8.
References • Slesarev A., Mezhevaya K. V., Makarova K. S., Polushin N. N., Shcherbinina O. V., Shakhova V. V., Belova G., Aravind L., Natale D. A., Rogozin I. B., Tatusov R. L., Wolf Y., Stetter K. O., Malykh A. G., Koonin E. V., Kozyavkin S. A. 2001. The complete genome of hyperthermophile Methanopyrus kandleri AV19 and monophyly of archaeal methanogens. PNAS 99: 4644–4649. • Huynen M. A., Bork P. 1998. Measuring genome evolution. Proc. Natl. Acad. Sci. USA 95: 5849–5856. • Sankoff D., Leduc G., Antoine N., Paquin B., Lang F., Cedergren. 1992. Gene order comparisons for phylogenetic inference: Evolution of the mitochondrial genome. Proc. Natl. Acad. Sci. USA 89: 6575-6579. • Boore J. L., Brown W. M. 1998. Big trees from little genomes: mitochondrial gene order as a phylogenetic tool. Current Opinion in Genetics & Development 8:668-674. • Huynen M. A., Snel B., Bork P. 2001. Inversions and the dynamics of eukaryotic gene order. Trends in Genetics 17: 304-306. • Korbel J. O., Snel B., Huynen M. A., Bork P. 2002. SHOT: a web server for the construction of genome phylogenies. Trends in Genetics 18: 158-162. • Lerat E., Daubin V., Moran N. A. 2003. From Gene Trees to Organismal Phylogeny in Prokaryotes:The Case of the c-Proteobacteria. PLoS Biology 1: 101-109. • Daubin V., Gouy M., Perrière G. 2002. A Phylogenomic Approach to Bacterial Phylogeny: Evidence of a Core of Genes Sharing a Common History. Genome Research 12:1080–1090. • Bininda-Emonds O. 2004. The evolution of supertrees. Trends in Ecology and Evolution 19: 315-322.
References • Kurland C. G., Canback B., Berg O. G. 2003. Horizontal gene transfer: A critical view. PNAS 100: 9658–9662. • Delsuc F., Brinkmann H., Philippe H. 2005. Phylogenomics and the reconstruction of the tree of life. Nature reviews genetics 6: 361-375. • Lake J. A., Rivera M. C. 2004. Deriving the Genomic Tree of Life in the Presence of Horizontal Gene Transfer: Conditioned Reconstruction. Mol. Biol. Evol. 21:681–690.
Comparison between Gene order and alignment-free • Gene order • Time consuming because they require gene identification • Compare genome using only partial genome information • Alignment-free • Use all genome info