1 / 27

Genomics of bacteria and archaea : the emerging dynamic view of the prokaryotic world

Genomics of bacteria and archaea : the emerging dynamic view of the prokaryotic world. E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October 2008. Extent of Prokaryotic Diversity. Only about 0.1% of bacteria can be cultured in the laboratory!

ron
Download Presentation

Genomics of bacteria and archaea : the emerging dynamic view of the prokaryotic world

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world E. V. Koonin and Y. I. Wolf Nucleic Acids Research 36:6688-6719 October 2008

  2. Extent of Prokaryotic Diversity • Only about 0.1% of bacteria can be cultured in the laboratory! • Currently about 1200 sequenced prokaryotic genomes • Large scale metagenomic surveys have not revealed abundant bacteria outside of already known phyla • Metagenomics = sequencing DNA found in the environment without growing or purifying the organisms. • Biggest survey: Craig Venter seawater survey. • Only about 10% of metagenomic sequences have no discernable homologs. • Possibly many new species exist in some unusual habitats?

  3. Genome Size • Current smallest: Carsonellarudii = 180 kbp • Current largest: Sorangiumcellulosum = 13 Mbp • Genomes less than 1 Mbp are all parasites or intracellular parasites, which don’t need to make all compounds from scratch. • 1 Mbp seems to be about the minimum size for a fee-living bacterium • The largest viruses (mimiviruses) are 1 Mbp or so; such viruses are common in marine habitats. • The smallest eukaryotic genomes (the obligate intracellular parasite Encephalitozoonintestinalis) is 2.3 Mbp

  4. Gene Density • Roughly 1 gene per 1000 bp in both bacteria and archaea • Intergenic spaces are either almost 0 bp (within operons) or average about 100 bp. • Longer intergenic spaces probably contain RNA-only genes or pseudogenes • Nearly all prokaryotic genes are a single open reading frame, with very few introns or split genes. • Gene overlaps are no more than a few base pairs: no documented cases of long overlaps.

  5. Clusters of Orthologous Groups of Genes • “orthologs” are genes that descend from the same gene in an ancestral species. • Need to be a bit looser in prokaryotes, where horizontal gene transfer is common • Often defined by “bidirectional best hits” (BBH): two genes (in different genomes) are each other’s best blast hit in those genomes. • Problem of gene duplication: paralogues. Paralogues are also derived from the common ancestor but have evolved different functions. • COGs are based on identifying orthologous genes, even if there is more than one in a given genome. • New derivative of COGs: EggNOGs (yuck). The database includes genes from 312 bacterial species and 26 archaea.

  6. COG results Out of the 338 genomes in EggNoG how many are missing in each COG? Same data in both plots, but the bottom one is semi-log. • How widespread are different orthologous gene families? • In most sequenced genomes, about 80% of genes can be assigned to a COG. • The rest of the genes have no detectable homology with any other protein; they are often called “ORFans” • There are very few COGs found in most or all organisms (“core” genes: about 70 gene clusters) • A larger, but still small number of COGs is moderately conserved, found in many genomes (“shell” genes: 5700 gene clusters) • The large majority of COGs are found in only a few genomes (“cloud” genes: 24,000 clusters)

  7. Percentage of Genes in EgGnOG COGs

  8. ORFans vs. ELFs • A gene with no detectable homology to any other protein in another species is an ORFan • What are ORFans? • Some are ELFs = Evil Little Fellows: falsely predicted genes; hypothetical genes are aren’t real. (BTW--I don’t think ELF is going to make it into standard genomics jargon, but ORFan might). • Some are real genes derived from bacteriophages. • Metagenomic studies suggest that the world bacteriophage genomes is vast and very under-explored. This is a very important concept that we will explore later. • In genome annotation, it is common to find prophages, which look like regions of the genome with many hypothetical genes mixed with a few genes labeled “phage protein” or “integrase/recombinase” • Some are just the tail end of the distribution of the “cloud” genes that are found in only a few genomes (in this case, just 1 genome). • How big is “gene space”, the totality of all genes? Could be several orders of magnitude larger than we know now, and almost all of it will be genes found in only 1 or a few genomes, or perhaps only in bacteriophage.

  9. COGs in Phylogenetic Groups • The presence or absence of members of each of the 30,000+ COG groups in all the 338 EggNog genomes can be used for cluster analysis. (Self-organizing map, here). • On the SOM, genomes close to each other share more COGs than genomes far apart on the map. • There is quite a correlation between COG presence/absence and known phylogenetic groups (based on 16S rRNA): different members of the same phylum group together. • with a few exceptions: gamma-proteobacteria are split, possibly due to a diversity of life styles.

  10. COGs vs. Gene Function • Genome annotation is based on the principle that if someone experimentally determines a gene’s function, then all other genes with similar protein sequences perform the same function. • Annotation also uses information about the gene’s chromosomal neighborhood: genes that are part of the same subsystem are often found grouped together. • We are not likely to be able to predict a protein’s function directly from its amino acid sequence anytime soon. • Non-orthologous gene displacement is common. When two organisms are compared, the same gene function is performed by two entirely different, non-homologous proteins. • This happens even in very fundamental processes like DNA replication: the primary enzymes for replication are entirely different between the bacteria and the archaea. • Because of non-homologous gene displacement, the “gene sequence homology space” in the previous slide is not identical to a SOM map of “gene function space” • SOM set up the same way: vector of presence/absence of different gene functions (functional roles) in different species • Here, the phyla are less well grouped. Perhaps because even closely related bacteria often have very different lifestyles.

  11. Genome Architecture • Most prokaryotes have a single DNA origin of replication (ori), which is used to define base 1 in a genomic sequence, as well as the orientation of the sequence. • DNA polymerase starts replication at ori and goes in both directions, which defines a “leading strand” (the right half of the genome) and a “lagging strand” (left half). • These can also be called the right and left replichores. • The two halves often have noticeably different base compositions (GC content, etc.). • Most genes, especially highly transcribed genes, are oriented in the same direction as replication.

  12. Dotplots to Compare Genome Structure • Compare positions of orthologous genes between 2 genomes, then plot positions. • A. Closely related genomes are mostly collinear, or syntenic (this is two Geobacillus species) • syntenic means that neighboring genes in one species are also neighbors in another species • B. Moderately related bacteria show an X-shaped pattern due to multiple inversions across the origin (which preserves the direction of transcription). Shewanella • C. X-pattern in 2 Archaea: Pyrococcus • D. Distantly related species show a random distribution of orthologs: genome is well-scrambled. • In general, only closely related species show any common genome architecture. The overall arrangement of genes on the chromosome is not well preserved.

  13. Bacillus Dotplot: B. megaterium vs. B. cereus --organization is conserved in the vicinity of the replication origin, but not in other regions.

  14. Operons • The classic operon is the E. coli lac operon. • Jacob and Monod, 1962 • Three genes involved in lactose utilization are transcribed onto a single messenger RNA • Transcription is under the control of a single transcription factor, the lac repressor. • When the lac repressor detects lactose, it allows the operon to be transcribed. • Most prokaryotes have numerous operons of many types

  15. Operons Across Species • Operon structure is conserved much better than overall chromosomal synteny • especially for genes whose proteins physically interact, such as the ribosomal proteins. • Interpretable as selection for having a balanced number of all subunits. • The 50+ ribosomal proteins are found grouped in different patterns across all prokaryotes. The ribosomal “superoperon” • other groups of partially conserved operons also exist, giving the general concept of the conserved gene neighborhood: even when they are not part of the same operon, genes involved in the same subsystem tend to stay near each other. • However, most operons are not part of superoperons, but rather just 2-4 genes that are oriented in the same direction and are co-transcribed and co-regulated. • Conservation is moderate: operon membership tends to change over phylogenetic distance. • However, most groups of adjacent genes in the same orientation are actually co-regulated as operons. • The percentage of genes in operons varies: very high in Thermotoga, very low in Cyanobacteria.

  16. Gene Regulation and Signal Transduction • Lac operon model: a single protein senses something in the environment (lactose) and directly alters transcription. • Some variants: genes transcribed from a common regulatory region in opposite directions (a divergon), and genes in multiple locations affected by the same regulatory protein (regulon) • The transcription factors (DNA binding proteins that affect transcription) are well conserved, but which genes are affected varies widely. • Transcription factors generally consist of a ligand-binding domain (e.g the part that binds to lactose) and a DNA binding domain. • Two component histidine kinase systems: • one protein in a membrane-bound histidine kinase that senses something in the extracellular environment. • The histidine kinase phosphorylates another protein, the response regulator, which is soluble and binds to the DNA to affect transcription. • Many other systems, often originally found in eukaryotes: cyclic AMP, cyclic di-GMP, programmed cell death systems, and more.

  17. Genome Size • Minimal number of genes: • for growth on rich medium, where there are very few biosynthetic requirements: maybe 250 genes. • Carsonella rudii, an obligate intracellular parasite, has only 170 genes. It even lacks some aminoacyl tRNA synthetases, and probably uses host enzymes for this function. Perhaps it is being converted into an organelle? (that’s just speculation, however) • for a free living heterotroph, maybe 1000 genes are needed • Pelegibacter ubique has about 1100 genes • given the presence of non-homologous gene replacement and different lifestyles, there are undoubtedly many more-or-less minimal genomes that survive.

  18. Gene Class vs. Genome Size • Some genes are found in about the same numbers in all genomes: translation machinery, cell division machinery. • Other genes are proportional to genome size: metabolic genes, transporters, DNA replication and repair • Other genes are proportional to the square of genome size: regulatory proteins. • Small genomes have very few regulatory proteins, while large genomes have lots. The fraction of regulatory genes increases as the total number of genes increases. • Note the exponent on the equations in the figure. • Leads to a proposed maximum genome size of about 20,000 genes: where each non-regulatory gene has its own regulatory gene

  19. Horizontal Gene Transfer • Defined as DNA transfer across species lines • As opposed to vertical gene transfer: genes transmitted from parent to offspring through chromosome replication and cell division. • Once considered unusual or controversial, it is now obvious that HGT is a frequent event with major effects on all prokaryotic genomes. • HGT has made the definition of “species” difficult in prokaryotes. • Pathogenicity islands: regions of up to 100 kbp, often near tRNA genes and often containing multiple prophage insertions. They contain genes needed for pathogenic behavior, such as toxins and type II secretion systems. • The classical three sexual processes in prokaryotes: • Conjugation: direct transfer of DNA between two cells. Certain plasmids have genes that cause conjugation. • Transduction: transfer of DNA through a bacteriophage intermediate • Gene transfer agents (GTAs) are defective bacteriophages that package and transfer random pieces of the bacterial genome, without killing the host cells. • Transformation: uptake of naked DNA from the environment.

  20. More HGT • In the absence of a direct genome comparison, horizontal gene transfer can be detected by differences in DNA composition: GC content, codon usage, oligonucleotide frequency, etc. • However, acquired genes undergo a process of “amelioration”, where selectively neutral mutations shift the DNA composition to match the host’s DNA. • Organisms that share a common environment often transfer genes, even across the bacteria-archaea divide. • Hyperthermophilic bacteria have up to 20% of their genes with better matches in the archaea than in other bacteria. • Similarly, mesophilic archaea sahre more genes with mesophilic bacteria Top= bacteria, bottom = archaea. In both cases, a mesophile is on the left And a hyperthermophile is on the right.

  21. HGT and Gene Loss • Genes are gained by horizontal gene transfer as well as by internal processes like duplication, and genes are also lost. The relative rates of these two events must be balanced to keep the genome reasonably constant in size. • Probably all COG groups have had at least one horizontal transfer. • But still, most genes are transferred vertically most of the time. • Several studies have shown that most genes within a group of organisms have a common phylogeny that matches the expectation of vertical descent. • Are genes involved in replication, transcription, and translation less prone to HGT? Based on the idea that these genes interact so intimately that they can’t be easily replaced. However, many cases of HGT in these genes have been seen, and there probably isn’t a big difference in rates. • The problem is, much HGT in informational genes is not easily detected because the COG families for these genes are “core”: nearly all species use slight variations on the same gene. Genes involved in the same metabolic function tend to fall into several different, non-homologous COGs.

  22. Selfish Operons • An operon can be thought of as a group of genes that act together to perform a single metabolic function. • Operons often even come with their own regulatory protein. • An operon thus provides a phenotype that natural selection can act on. • You can think of operons as selfish: travelling between genomes, conferring a useful trait that increases the number of copies of that operon in the world. • An example: membrane-bound ATP synthases, the primary way most species generate energy. There is an archaeal version and a bacterial version. Both are encoded by a single operon, which has been transferred many times across the domain boundary.

  23. The Prokaryotic Mobilome • “-ome” means the set of all things with this function. I will admit to feeling that this suffix is over-used these days. • The mobilome is the set of bacteriophages, plasmids, transposable elements, and associated genes that frequently travel between the genomes of cellular life. • All sequenced genomes show signs of multiple integrated phages and plasmids • Bacteriophages are everywhere: it has been estimated that there at 10 times as many phage particles as cells in sea water. • Plasmids are replicons independent of the chromosome. • Usually circular, but sometimes linear. • Usually not necessary for life, but some are very integrated into the life of the cell. • Some integrate into the chromosome. • Plasmid addiction: seen in restriction/modification systems and toxin/antitoxin systems • Put the toxin gene on the chromosome and the antitoxin on the plasmid. If the plasmid is lost, the cell dies because it makes the toxin but not its antidote. • This is a selfish process: the plasmid benefits but the cell is forced to keep replicating a useless plasmid.

  24. The Principal Processes of Prokaryotic Evolution • Three basic processes: • 1. vertical transfer from parent to child • 2. horizontal transfer between species • 3. mobilome genes that are occasionally recruited to perform useful functions for the cells. • Another important process: gene loss • Sometimes under strong selection, as is the development of parasitism, where many genes are no longer needed. • More generally, there is a weak selection pressure to remove unneeded genes. • The gene-centric perspective, as opposed to the classical genome-centric viewpoint. Individual genes can be considered distinct evolutionary units that are subject to selection across species and compete with other genes. • In the gene-centric view, a genome is a community of genes that have some degree of selfishness. • Both views have validity.

More Related