230 likes | 368 Views
Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects . Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics U.T. Southwestern Medical Center. NIAID Bioinformatics Resource Centers. www.pathogenportal.net.
E N D
Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics U.T. Southwestern Medical Center
NIAID Bioinformatics Resource Centers www.pathogenportal.net
Influenza Research Database www.fludb.org
Metadata Inconsistencies • Each project was providing different types of metadata • No consistent nomenclature being used • Impossible to perform reliable comparative genomics analysis
GSC-BRC Metadata Standards Working Group • NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs • Develop metadata standards for pathogen isolate sequencing projects
Metadata Standards Process • Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors • Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS) • Identify data fields that appear to be common across projects within a pathogen subgroup (core) and data fields that appear to be project specific • For each data field, provide: • definitions, • synonyms, • allowed value sets preferably using controlled vocabularies, • expected syntax, • examples, • data categories, • data providers • Merge subgroup core elements into a common set of core metadata fields and attributes • Assemble metadata fields into a semantic network • Harmonize semantic network with the Ontology of Biomedical Investigation (OBI) • Compare, harmonize, map to other relevant initiatives, including MIGS, MIMS, BioProjects, BioSamples • Establish policies and procedures for metadata submission workflows and GenBank linkage • Develop data submission spreadsheets to be used for all white paper and BRC-associated projects
Core Sample Metadata 30 Core Sample Metadata Fields
Core Project Metadata 16 Core Project Metadata Fields
Metadata Standards Process • Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors • Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS) • Identify data fields that appear to be common across projects within a pathogen subgroup (core) and data fields that appear to be project specific • For each data field, provide: • definitions, • synonyms, • allowed value sets preferably using controlled vocabularies, • expected syntax, • examples, • data categories, • data providers • Merge subgroup core elements into a common set of core metadata fields and attributes • Assemble metadata fields into a semantic network (Scheuermann) • Harmonize semantic network with the Ontology of Biomedical Investigation (OBI) (Stoeckert, Zheng) • Compare, harmonize, map to other relevant initiatives, including MIGS, MIMS, BioProjects, BioSamples • Establish policies and procedures for metadata submission workflows and GenBank linkage • Develop data submission spreadsheets to be used for all white paper and BRC-associated projects
Specimen Isolation v5-6 v15 v16 v17 v19 v18 v2 v7 v8 v9 v3-4 temporal interval date/time denotes has_part spatial region GPS location temporal-spatial region b30 b18 b27 b25 b26 b28 b24 b23 b22 b29 located_in denotes spatial region geographic location located_in Comments has_quality organism ID specimen source role plays ???? environment denotes environmental material has_input specimen isolation procedure X instance_of has_output specimen X specimen type has_input specimen capture role plays equipment is_about has_specification has_authorization organism part hypothesis specimen collector role plays person instance_of isolation protocol IRB/IACUC approval has_affiliation denotes specimen isolation procedure type affiliation name
Metadata Processes temporal-spatial region Investigation Specimen Isolation located_in Material Processing type ID qualities denotes temporal-spatial region instance_of has_quality located_in specimen source – organism or environmental has_output has_output has_input specimen isolation process sample processing enriched NA sample specimen has_input specimen collector has_specification has_part has_part isolation protocol microorganism genomic NA microorganism is_about Sequencing Assay Data Processing has_output is_about data transformations – variant detection serotype marker detect. gene detection genotype/serotype/ gene data input sample has_input has_output has_output has_output has_input has_input reagents has_input data transformations – image processing assembly data archiving process sequence data sequence data record primary data sequencing assay technician denotes equipment GenBank ID
Generic Assay analyte X sample type instance_of has_part sample material X sample ID denotes GPS location geographic location date/time target role plays has_quality denotes denotes located_in quality x spatial region temporal interval spatial region reagent type instance_of has_input has_part lot # material X denotes temporal-spatial region reagent role plays has_input located_in has_output is_about species instance_of primary data input sample material X assay X has_input person X name denotes technician role has_specification instance_of plays denotes has_input equipment type instance_of assay type assay protocol run ID equipment X serial # denotes has_part signal detection role plays objectives
Generic Material Transformation sample type instance_of sample material X sample ID denotes GPS location geographic location target role plays has_quality date/time denotes denotes quality x located_in reagent type spatial region temporal interval spatial region instance_of has_input lot # material X denotes has_part reagent role plays temporal-spatial region has_input quality x located_in has_quality has_output species instance_of output material X material transformation X denotes sample ID has_input person X name denotes instance_of material type technician role has_specification instance_of plays denotes has_input equipment type instance_of material transformation type material transformation protocol run ID equipment X serial # denotes has_part signal detection role plays objectives
Generic Data Transformation GPS location geographic location date/time denotes denotes located_in spatial region temporal interval spatial region has_part temporal-spatial region located_in software has_input has_output output data input data data transformation X has_specification instance_of denotes person X name denotes plays is_about run ID data transformation type algorithm data analyst role material X
Generic Material (IC) GPS location GPS location geographic location geographic location date/time date/time denotes denotes denotes denotes denotes denotes located_in located_in spatial region spatial region spatial region spatial region temporal interval temporal interval has_part has_part temporal-spatial region temporal-spatial region located_in located_in quality x quality y has_quality has_quality material type instance_of material X denotes ID has_part has_part material Y material Z
Conclusions Utility of semantic representation Identified gaps in data field list (e.g. temporal components) Identified gaps in ontology data standards (use case-driven standard development) Identified commonalities in data structures (reusable) Support for semantic queries and inferential analysis in future Two flavors of MIBBI Distinguish between minimum information to reproduce an experiment and the minimum information to structure in a database for query and analysis OBI-based framework is re-usable Sequencing => “omics” Practical issues about implementation strategies Challenge of using ontologies for preferred value sets Can be large May not directly match common language