1 / 16

Wade M. Sheldon Mary Ann Moran James T. Hollibaugh

Efforts to Link Ecological Metadata with Bacterial Gene Sequences at the Sapelo Island Microbial Observatory. Wade M. Sheldon Mary Ann Moran James T. Hollibaugh. Genetic Sequence Databases. Major informatics success story

nedaa
Download Presentation

Wade M. Sheldon Mary Ann Moran James T. Hollibaugh

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efforts to Link Ecological Metadata with Bacterial Gene Sequences at theSapelo Island Microbial Observatory Wade M. Sheldon Mary Ann Moran James T. Hollibaugh

  2. Genetic Sequence Databases • Major informatics success story • Large repositories for nucleotide sequences (e.g. GenBank/EMBL/NDDJ ~16M) • Automated and web-based data submission - required as part of publication process • Standardized alignment/search tools support use for classification • Numerous ‘environmental sequences’ – ecologists now using to study biogeography, community structure, eco-physiology

  3. Problems with GenBank • Metadata voluntary – limited in scope • Title (definition), authors, key words, comments, literature citation • Many sequences unpublished, undescribed • Quality control standards poorly enforced • No direct way to provide links to ancillary data (URLs not officially supported, often removed) • Very inefficient and often impossible for investigators to obtain ecological context information, even from journals • Comparisons of matched taxa by traits not possible

  4. Consequence • Tremendous amount of bacterial sequence data relevant to microbial ecologists • No established interface

  5. Example – Insufficient Metadata

  6. Sapelo Island Microbial Observatory (http://simo.marsci.uga.edu) • MObs – NSF-funded network of sites or "microbial observatories" established to discover novel microorganisms, microbial consortia, communities, activities and other novel properties, and to study their roles in diverse environments • Projects supported are expected to establish or participate in an established, Internet-accessible knowledge network to disseminate the information resulting from these activities • SIMO - Investigating the diversity of prokaryotes, their physiological and genetic characteristics, and their biogeochemical activities in a salt marsh/estuarine ecosystem in the southeastern U.S. • Knowledge networks: • GenBank • GCE-LTER IS • SIMO 16S rRNA Database

  7. SIMO 16S rRNA Database • Purpose: LIMS, research tool, data dissemination • Designed to store sequence data and all supporting SIMO research information • Hierarchical structure modeled after research workflow • Metadata on site geography, sample collection, all methodology, personnel, ancillary measurements • Extensive content control, error checking • Links to information in external databases (RDP II, GenBank, GCE-LTER) • Queries by phylogenic and/or ecological characteristics

  8. Conceptual Diagram of the SIMO Database

  9. List-based data entry linked to metadata tables

  10. Controlled vocabulary supports finely-targeted queriesAutomatic hyperlinks provide links to tasks

  11. List-based queries also simplify public interface

  12. Phylogenetic and ecological characteristics combined dynamically to create overview and query interface

  13. SIMO Metadata • Metadata primarily stored in managed lists, linked to records by foreign key fields • Scalable design – details can be added independently without altering data records • Complete metadata for sequences generated by relational joins • Links to external metadata in GCE-LTER database adds site geography, research history, long-term environmental characteristics

  14. Metadata Standards • No existing standard for environmental sequence metadata • Sequence formats (FASTA, BIOML, BSML) designed for data parsing, sequence annotation • SIMO metadata currently displayed in summary form on sequence detail pages • Exploring adopting emerging standards like EML

  15. Sequence Details

  16. Future Directions • Incorporating batch upload features for library submissions • Integrating database with ‘RDP SeqMatch Agent’ programs for automatic phylogenetic analysis, sequence annotation • Provide full metadata in formatted/printable and parsable ASCII formats (XML) • Participate in Entrez Link-Out to provide links to SIMO sequence entries from GenBank

More Related