160 likes | 326 Views
Standardized Metadata for Human Pathogen/Vector Genomic Sequences. Richard H. Scheuermann, Ph.D. Director of Informatics J. Craig Venter Institute On behalf of the GSC-BRC Metadata Working Group. Genome Sequencing Centers for Infectious Disease (GSCID).
E N D
Standardized Metadata for Human Pathogen/Vector Genomic Sequences Richard H. Scheuermann, Ph.D. Director of Informatics J. Craig Venter Institute On behalf of the GSC-BRC Metadata Working Group
Genome Sequencing Centers for Infectious Disease (GSCID) Bioinformatics Resource Centers (BRC) www.viprbrc.org www.fludb.org
High Throughput Sequencing • Enabling technology • Epidemiology of outbreaks • Pathogen evolution • Host range restriction • Genetic determinants of virulence and pathogenicity • Metadata requirements • Temporal-spatial information about isolates • Selective pressures • Host species of specimen source • Disease severity and clinical manifestations
Metadata Submission Spreadsheets 1 1 1 1 4 4 3 2 2 4 3
Metadata Inconsistencies • Each project was providing different types of metadata • No consistent nomenclature being used • Impossible to perform reliable comparative genomics analysis • Required extensive custom bioinformatics system development
GSC-BRC Metadata Standards Working Group • NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs • Develop an approach for capturing standardized metadata for pathogen isolate sequencing projects • Bottom up approach to capture data considered to be important by users • Compatible with data standards and submission requirements
Metadata Standardization Process • Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS) • Identify data fields that appear to be common across projects and samples (core) and data fields that appear to be pathogen or project specific • For each data field, provide common set of attributes, including preferred term, definition, synonyms, allowed value sets preferably using controlled vocabularies, expected syntax, etc. • Assemble all metadata fields into a semantic network based on the Ontology of Biomedical Investigation (OBI) • Compare, map, and harmonize to other relevant initiatives, including Genome Standards Consortium MIxS and NCBI BioProjects/BioSamples • Draft data submission spreadsheets • Beta test version 1.0 standard with new GSCID white paper projects, collecting feedback • Adopt version 1.1 metadata standard and data submission spreadsheets for all GSCID white paper and BRC-associated projects
Metadata Processes Host Characterization Investigation temporal-spatial region Specimen Isolation Material Processing qualities located_in has_output temporal-spatial region quality assessment assay temporal-spatial region has_quality has_input located_in located_in has_output has_output has_input specimen source – organism or environmental specimen isolation process sample processing enriched NA sample has_input specimen instance_of denotes has_specification has_part has_part specimen collector type ID isolation protocol microorganism genomic NA microorganism is_about Data Processing Sequencing Assay has_output data transformations – variant detection serotype marker detect. gene detection genotype/serotype/ gene data is_about input sample has_input has_output has_output has_output has_input has_input reagents has_input data transformations – image processing assembly data archiving process sequence data sequence data record primary data sequencing assay technician located_in located_in denotes located_in temporal-spatial region equipment temporal-spatial region GenBank ID temporal-spatial region
Specimen Isolation temporal interval date/time ID gender age health status CS2/3 CS5/6 CS18 CS13 CS14 CS1 CS8 CS7 CS4 denotes has_part denotes spatial region GPS location temporal-spatial region CS9/10 has quality located_in organism denotes specimen source role spatial region geographic location plays CS11/12 environmental material located_in has_quality ID has part environment denotes pathogenic disposition has disposition organism has_input specimen isolation procedure X instance_of has_output specimen X specimen type has_input specimen capture role plays equipment is_about has_specification has_authorization organism part hypothesis specimen collector role plays person instance_of CS15/16 isolation protocol IRB/IACUC approval has_affiliation denotes specimen isolation procedure type affiliation name
Outcome of Metadata Standards WG • Consistent metadata captured across GSCID • Bottom up approach focuses standard on important features • Support more standardized BRC interface development • Harmonization with related stakeholders – Genome Standards Consortium MIxS, OBO Foundry OBI and NCBI BioProject/BioSample • Represented in the context of an extensible semantic framework
Identified gaps in data field list (e.g. temporal components) Includes logical structure for other, project-specific, data fields - extensible Identified gaps in ontology data standards (use case-driven standard development) Identified commonalities in data structures (reusable) Support for semantic queries and inferential analysis in future Ontology-based framework is extensible Sequencing => “omics” Utility of semantic representation
Acknowledgements Bruce Birren2,b, Lauren Brinkac1,a, Vincent Bruno3,c, Elizabeth Caler1,a, Ishwar Chandramouliswaran1,a, Sinéad Chapman2,b, Frank Collins8,h, Christina Cuomo2,b, Joana Carneiro Da Silva3,c, Valentina Di Francesco4, Vivien Dugan1,a, Scott Emrich8,h, Mark Eppinger3,c, Michael Feldgarden2,b, Claire Fraser3,c, W. Florian Fricke3,c, Maria Giovanni4, Gloria Giraldo-Calderon8,h, Omar S. Harb5,g, Matt Henn2,b, Erin Hine3,c, Julie Dunning Hotopp3,c, Jessica C. Kissinger6,g, EunMi Lee4, Punam Mathur4, Garry Myers3,c, Emmanuel Mongodin3,c, Cheryl Murphy2,b, Dan Neafsey2,b, Karen Nelson1,a, Ruchi Newman2,b, William Nierman1,a, Brett E. Pickett1,d,e, Julia Puzak4, David Rasko3,c, David S. Roos5,g, Lisa Sadzewica3,c, Richard H. Scheuermann1,d,e, Lynn M. Schriml3,c, Bruno Sobral7,f, Tim Stockwell1,a, Chris Stoeckert5,g, Dan Sullivan7,f, Luke Tallon3,c, Herve Tettelin3,c, Doyle V. Ward2,b, David Wentworth1,a, Owen White3,c, Rebecca Will7,f, Jennifer Wortman2,b, Alison Yao4, Jie Zheng5,g 1J. Craig Venter Institute, Rockville, MD and San Diego, CA, 2Broad Institute, Cambridge, MA, 3Insitute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, 4National Institute of Allergy and Infectious Diseases, Rockville, MD, 5University of Pennsylvania, Philadelphia, PA, 6University of Georgia, Athens, GA, 7Cyberinfrastructure Division, Virginia Bioinformatics Institute, Blacksburg, VA, 8University of Notre Dame, South Bend, IN, aJ. Craig Venter Institute Genome Sequencing Center for Infectious Diseases, bBroadInstitute Genome Sequencing Center for Infectious Diseases, cInstitutefor Genome Sciences Genome Sequencing Center for Infectious Diseases, dInfluenzaResearch Database Bioinformatics Resource Center, eVirusPathogen Resource Bioinformatics Resource Center, fPATRICBioinformatics Resource Center, gEuPathDBBioinformatics Resource Center, hVectorBaseBioinformatics Resource Center Tanya Barrett – NCBI PelinYilmaz – Genome Standards Consortium N01AI2008038 /N01AI40041