Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics

Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their Implementation in NIAID Bioinformatics Resource Centers Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics U.T. Southwestern Medical Center N01AI2008038 N01AI40041

Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their Implementation in NIAID Bioinformatics Resource Centers Richard H. Scheuermann, Ph.D. Director of Informatics J. Craig Venter Institute N01AI2008038 N01AI40041

Genome Sequencing Centers for Infectious Disease (GSCID) Bioinformatics Resource Centers (BRC) www.viprbrc.org www.fludb.org

High Throughput Sequencing • Enabling technology • Epidemiology of outbreaks • Pathogen evolution • Host range restriction • Genetic determinants of virulence and pathogenicity • Metadata requirements • Temporal-spatial information about isolates • Selective pressures • Host species of specimen source • Disease severity and clinical manifestations

Metadata Submission Spreadsheets 1 1 1 1 4 4 3 2 2 4 3

Complex Query Interface

Metadata Inconsistencies • Each project was providing different types of metadata • No consistent nomenclature being used • Impossible to perform reliable comparative genomics analysis • Required extensive custom bioinformatics system development

GSC-BRC Metadata Standards Working Group • NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs • Develop metadata standards for pathogen isolate sequencing projects • Bottom up approach • Assemble into a semantic framework

GSC-BRC Metadata Working Groups

Metadata Standards Process • Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors • Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS) • Identify data fields that appear to be common across projects within a pathogen subgroup (core) and data fields that appear to be project specific • For each data field, provide common set of attributes, including definitions, synonyms, allowed value sets preferably using controlled vocabularies, and expected syntax, etc. • Merge subgroup core elements into a common set of core metadata fields and attributes • Assemble set of pathogen-specific and project-specific metadata fields to be used in conjunction with core fields • Compare, harmonize, map to other relevant initiatives, including OBI, MIGS, MIxS, BioProjects, BioSamples (ongoing) • Assemble all metadata fields into a semantic network (ongoing) • Harmonize semantic network with the Ontology of Biomedical Investigation (OBI) • Draft data submission spreadsheets to be used for all white paper and BRC-associated projects • Finalize version 1.0 metadata standard and version 1.0 data submission spreadsheet • Beta test version 1.0 standard with new white paper projects, collecting feedback

Data Fields: Core Project Core Sample Attributes

Specimen Isolation temporal interval date/time ID gender age health status CS2/3 CS5/6 CS18 CS13 CS14 CS1 CS8 CS7 CS4 denotes has_part denotes spatial region GPS location temporal-spatial region CS9/10 has quality located_in organism denotes specimen source role spatial region geographic location plays CS11/12 environmental material located_in has_quality ID has part environment denotes pathogenic disposition has disposition organism has_input specimen isolation procedure X instance_of has_output specimen X specimen type has_input specimen capture role plays equipment is_about has_specification has_authorization organism part hypothesis specimen collector role plays person instance_of CS15/16 isolation protocol IRB/IACUC approval has_affiliation denotes specimen isolation procedure type affiliation name

Metadata Processes Quality Assessment Investigation temporal-spatial region Specimen Isolation Material Processing qualities located_in has_output temporal-spatial region quality assessment assay temporal-spatial region has_quality has_input located_in located_in has_output has_output has_input specimen source – organism or environmental specimen isolation process sample processing enriched NA sample has_input specimen instance_of denotes has_specification has_part has_part specimen collector type ID isolation protocol microorganism genomic NA microorganism is_about Data Processing Sequencing Assay has_output data transformations – variant detection serotype marker detect. gene detection genotype/serotype/ gene data is_about input sample has_input has_output has_output has_output has_input has_input reagents has_input data transformations – image processing assembly data archiving process sequence data sequence data record primary data sequencing assay technician located_in located_in denotes located_in temporal-spatial region equipment temporal-spatial region GenBank ID temporal-spatial region

Outcome of Metadata Standards WG • Consistent metadata captured across GSCID • Guidance to collaborators regarding metadata expectations for sequencing and analysis services • Support more standardized BRC interface development • Harmonization with related stakeholders – Genome Standards Consortium MIxS, OBO Foundry OBI and NCBI BioSample • Represented in the context of an extensible semantic framework

Conclusions Metadata standards for microorganism sequencing projects Bottom up approach focuses standard on important features Harmonizing with related standards from the Genome Standards Consortium, OBO Foundry and NCBI Being beta-tested by GSCIDs for adoption by all NIAID-sponsored sequencing projects Utility of semantic representation Identified gaps in data field list (e.g. temporal components) Includes logical structure for other, project-specific, data fields - extensible Identified gaps in ontology data standards (use case-driven standard development) Identified commonalities in data structures (reusable) Support for semantic queries and inferential analysis in future Ontology-based framework is extensible Sequencing => “omics”

Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics

Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics

Presentation Transcript

Richard H. Scheuermann, Ph.D. Department of Pathology, UT Southwestern March 30, 2011

The Future of Pathology Informatics: The Biomedical Informatics Department Model

DIVISION OF BIOMEDICAL SCIENCES

The Future of Biomedical Informatics

Richard H. Scheuermann, Ph.D. November 5 , 2012

Richard H. Scheuermann, Ph.D. November 5 , 2012

Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

19 July 2011 Richard H. Scheuermann, Ph.D. Department of Pathology

8 December 2010 Richard H. Scheuermann, Ph.D. Department of Pathology

Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Richard H. Scheuermann U.T. Southwestern Medical Center

Victor Jin Department of Biomedical Informatics Ohio State University

Jarek Meller Division of Biomedical Informatics,

Biomedical Informatics

Richard H. Scheuermann, Ph.D. Department of Pathology U.T. Southwestern Medical Center 19 JAN 2011

Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics

Stephen Langella langella@bmi.osu Department of Biomedical Informatics

Department of pathology

James J. Cimino, M.D. Department of Biomedical Informatics

Biomedical Informatics