1 / 1

Pangenomes How Many Microbial Genes Are There In the World?

Pangenomes How Many Microbial Genes Are There In the World? Nicholas P. Celms, James D. Nulton, Dr. Rob Edwards, Dr. Peter Salamon. The Concept

thao
Download Presentation

Pangenomes How Many Microbial Genes Are There In the World?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pangenomes • How Many Microbial Genes Are There In the World? • Nicholas P. Celms, James D. Nulton, Dr. Rob Edwards, Dr. Peter Salamon The Concept Imagine a shopper in a store, treating each item as an individual probability test to see if they buy it or not. Now, imagine the store organized its aisles based on the probabilities of its items. This analogy is very similar to the Pangenome concept, only instead of items in a store, the focus is proteins in a genome, and instead of aisles, they proteins are grouped into pools. Figures 1 and 2 (below) demonstrate this distributive process. • METHODS • The analysis process has two major parts, both of which rely on the Pangenome Matrix (see glossary). • Pangenome Methods • Observed share spectrum is calculated (blue bars in Figure 2). • Optimum number of pools selected by Akaike Information Criterion scoring • Predicted share spectrum calculation (red stars in Figure 2). • Pool distribution predictions (see Figure 1) and probabilities result • Sum of # Genes gives predicted total Pangenome size Clique and Clan Methods • A protein’s column in the Pangenome matrix forms a binary string • The binary string has ones at indices representing strains that have this protein • Proteins with identical binary strings form a clique • Cliques are identified and annotating using Perl and the NMPDR database • Glossary • Pangenome – the unique set of all proteins found in all strains of an organism • ESS – Dataset of Escherichia (22 strains), Shigella, (8 strains) and Salmonella (15 strains) combined. Pangenome Matrix – a matrix with columns as proteins of the Pangenome and rows as strains of the organism. For a given index i,j : 1 if strain i has protein j, 0 if it does not. • The Pangenome matrix for ESS is 45 strains x 12410 proteins • Clique - A set of proteins that occur in the exact same strains • Clan- The set of strains in which a given clique appears Figure 5: (above) The number of cliques identified at each size of clan. Figure 6: (below) This table shows one of the cliques found in Escherichia, Shigella, and Salmonella. It’s binary signature is: 0110011111010111011110110100100101101110110011 Figure 1: ((left) shows the number of genes and probabilities for each of the 9 pools for Escherichia, Shigella, and Salmonella. Figure 2: (right) shows the distributed share spectrum of ESS, which is the number of genes found in number of strains. Figure 3: For any given protein, the probability that it will be chosen by exactly k out of the n strains is given by this binomial expression. Figure 4: This equation defines the expectations, as based on the parameters of the model • Pangenome Analysis • Pangenome analysis offers many obvious conclusions, and some that are more subtle. • Figure 2 (above) shows clearly that roughly 3000 proteins are only found in one of the 45 strains of ESS. These are called “distinctive features”. Almost 2000 proteins were found in all 45 strains, which is defined as the “conserved” set. • Summing # genes column gives predicted Pangenome size. • With 12410 proteins in the Pangenome Matrix, the predicted Pangenome size for ESS is approximately 25006 genes. • If 12410 genes are already sequenced, and the predicted total is 25006, the predicted completeness of sequencing for ESS is 49.6%. Cliques and Clans See the glossary for term definitions. Since cliques appear in unison in their clan, it is very likely that these genes are functionally related. One example of functionally-related cliques identified is listed in figure 6, to the right. All of these proteins are phage-related, indicating that these genes entered the strains by horizontal gene transfer. Analysis of clique results can help determine phylogenetic relationships, divergent events, and horizontally-transferred genes. Some cliques are not statistically unlikely, while others show highly improbable clustering. A metric for determining statistical unlikelihood is under development presently, and will accelerate the process of determining the value of further investigation on a clique. Contact Information Nicholas Celms: nick.celms@gmail.com Contact me if you’d like a copy of the paper, or further information about the Pangenomes Project funding Thanks to the National Science Foundation for funding the Undergraduate Bio Math Program at San Diego State University

More Related