1 / 20

Subsystem Approach to Genome Annotation

Subsystem Approach to Genome Annotation. National Microbial Pathogen Data Resource www.nmpdr.org Claudia Reich NCSA, University of Illinois, Urbana. Complete Microbial Genomes. 464 complete microbial genomes in NCBI as of 3-1-07 691 microbial genomes in progress as of 3-1-07.

kemp
Download Presentation

Subsystem Approach to Genome Annotation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource www.nmpdr.org Claudia Reich NCSA, University of Illinois, Urbana

  2. Complete Microbial Genomes • 464 complete microbial genomes in NCBI as of 3-1-07 • 691 microbial genomes in progress as of 3-1-07 www.nmpdr.org

  3. Making Sense of Genome Data • Locate Genes: identify ORFs automatically • GeneMark • NCBI’s ORF Finder • Glimmer • Critica • Assign Function: by sequence similarity to experimentally characterized proteins • BLAST family of sequence comparison tools www.nmpdr.org

  4. Problems with Assignments by Similarity • When ORF is a member of a protein family • Paralogous genes • ORFs encoding similar proteins acting on different substrates • Assignments can be transitive, and many times removed from experimental data www.nmpdr.org

  5. Other Factors Can Aid in Function Assignments • Molecular phylogeny • Paralogous and orthologous families • Conserved gene neighborhood • Metabolic context • Bidirectional best hit matches across multiple genomes www.nmpdr.org

  6. Incorporating Information Other Than Similarity • KEGG: manually curated pathway and metabolic maps • GO: vocabularies that describe ORFs as associated with • biological processes • cellular components • molecular function • MetaCyc: experimentally elucidated metabolic pathways www.nmpdr.org

  7. What is Needed: • A system that: • integrates all the above concepts • organizes genomic data in structured idioms • allows high-throughput annotation of newly sequenced genomes • resolves discrepancies in different annotation tools • informs experimental research www.nmpdr.org

  8. Enter the SEED* • Database and annotation environment • Underlies, and accessible through, NMPDR (www.nmpdr.org) • Expert annotation via subsystems building • Provides the most accurate genome annotations available *Argonne National Lab, University of Chicago, UIUC, FIG www.nmpdr.org

  9. What is a Subsystem? • Any organizing biological principle: • metabolic pathway • amino acid biosynthesis, nitrogen fixation, glycolysis • complex structure • ribosome, flagellum • set of defining features • virulome, pathogenicity islands • functional concept • bacterial sigma factors, DNA binding proteins www.nmpdr.org

  10. Subsystems are: • Sets of functional roles, which are functions, or abstractions of functions (such as an EC number), that together implement a specific biological process or concept • Created manually by expert curators • Experts annotate single subsystems over the complete collection of genomes, thus contributing and sharing their expertise with the scientific community www.nmpdr.org

  11. How Subsystems are Built • Create a subsystem for the biological concept, and define the functional roles • In one (or a few) key organisms that include the subsystem, find the genes and assign meaningful functional names • Project the annotations to orthologous genes • Expand to more genomes, creating a Populated Subsystem www.nmpdr.org

  12. Populated Subsystems • Are Spreadsheets where: • Columns: functional roles • Rows: specific genomes • Cells: genes in the organism that implement the functional role www.nmpdr.org

  13. How to Access Subsystems • From Home page (left navigation bar): Subsystem Summaries: select organism • From Organism pages • From Subsystem Search • From protein pages: to specific subsystems www.nmpdr.org

  14. Subsystem Pages in NMPDR • Table of Functional Roles • Subsystem diagram (if appropriate) • Populated subsystem spreadsheet • Customizable spreadsheet viewing options • Functional variants and subsets of roles • Curator’s notes www.nmpdr.org

  15. Benefits of Subsystems • More accurate annotations • Annotation of protein families • Analysis of sets of functionally related proteins • Less error-prone to automatic projections to novel genomes www.nmpdr.org

  16. Subsystems Reveal Interesting • Pathway variants: • Are they clustered by phylogeny? • Delta subunit of RNA polymerase only Bacillales • Are they clustered by functional niche? • Horizontal gene transfer? • Fused genes: •  and ’ subunit of RNA polymerase fused in Helicobacter • Fissioned genes: • ’ subunit of RNA polymerase is fissioned in Cyanobacteria www.nmpdr.org

  17. Subsystems Reveal Interesting • Duplicate assignments • More than one gene for one functional role? • Alpha subunit of RNA polymerase in Magnetococcus and Francisella • Same sequenced region in more than one contig in partially assembled genomes? • Frameshifts or other sequencing errors? • Annotation errors? www.nmpdr.org

  18. Subsystems Reveal Interesting • Missing genes: • Is the function essential? • Is the function conserved? • Does the missing gene cluster with homologs in other organisms? • Is the function performed by a newly recruited gene? • Has a gene been acquired by horizontal gene transfer and now performs that function? www.nmpdr.org

  19. Synthesis of Selenocysteinyl-tRNA • Two known pathway variants • One step in Bacteria • SelA is annotated • Two steps in Archaea and Eucarya • PSTK was missing until very recently www.nmpdr.org

  20. Explore Selenocysteine Usage • Start by searching for gene name, selA, in an organism known to use Sec, E. coli K12 • Start from subsystem tree; expand category of "Protein metabolism," expand subcategory of "Selenoproteins" • Open "Selenocysteine metabolism" subsystem from protein page or SS tree • Genomes arranged phylogenetically • Roles defined on mouse-over • What genes are missing in which organisms? • Are there Sec metabolism genes present in any organisms that do not have proteins that need Sec? • Are there organisms known to need Sec for certain proteins, but that do not have a complete Sec biosynthesis pathway? • Why is there a hypothetical protein included in this subsystem? www.nmpdr.org

More Related