1 / 27

BRC2011 Session #5 – Data Standards and Metadata

BRC2011 Session #5 – Data Standards and Metadata. Session chair: Richard Scheuermann ( ViPR & IRD). Session #5 - Outline. Motivation Opportunities, Challenges and Talking Points minimum information checklists ontology-based value sets use cases for metadata

joy
Download Presentation

BRC2011 Session #5 – Data Standards and Metadata

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BRC2011Session #5 – Data Standards and Metadata Session chair: Richard Scheuermann (ViPR & IRD)

  2. Session #5 - Outline • Motivation • Opportunities, Challenges and Talking Points • minimum information checklists • ontology-based value sets • use cases for metadata • SOPs for data & metadata acquisition • Ontology of Biomedical Investigations – Bjoern Peters • Infectious Disease Ontology and extensions – Lindsay Cowell • GSCID-BRC Metadata Working Group efforts • Open discussion

  3. Why Data Standards Interoperability - the ability to exchange information between people, organizations, machines Comparability - the ability to ascertain the equivalence of data from different sources Data Quality – asses the completeness, accuracy and precision of the data Dependability – ensures that you get what you expect from a database query Accurate Statistical Analysis Inference

  4. What Data Standards • Minimum Information Sets – what needs to be described • Structured Vocabulary/Ontology – how to describe them • Term strings – unique identifiers • Definitions - what terms mean • Syntax - how terms are used • Semantics - how the components relate to each other

  5. Session #5 – Challenges • Status of relevant data standards • Few data standards that have been widely adopted by the infectious diseases community • Some standards are being development without engagement of all relevant stakeholders • If we drive standards development, how do we get broad adoption • Adoption of data standards by data providers • Even if vocabulary standards are available, how do we get the broader community to use them • How do we educate them to use the data standards accurately • How to keep the barrier low for getting required meta-data in a standard format • Technical challenges • Usability is constrained by spreadsheet interface • Ontology-based controlled vocabularies sometimes too large for spreadsheet like interface or drop down lists • While web-based GUI smart forms are good for single submission, difficult to design them to scale • Need for quality control and curation • If data standards are not enforced, mapping to standards may be required • Problems with homonyms (Turkey vs turkey) and synonyms (Puerto Rico and PR) • Not all tasks in metadata collection lend themselves to automation • Data entry quality control mechanisms are especially limited because of spreadsheet functionality • Could be 1-2 FTEs; not budgeted • Compliance with HIPAA and other privacy regulations. • PATRIC does not anticipate working with identifying data but GSCIDs and investigators could be delayed by compliance issues • Special cases • Metadata for genomes for NBCI bulk submission and non-unique taxon ids. • Metadata for growth conditions to be used with transcript datasets • Metadata for metagenomes to correlate genomes and proteins with useful info about sites and conditions • How to we effectively exploit standardized data and metadata

  6. Session #5 – Opportunities • Existing relevant ontologies are in decent shape – GO, IDO, OBI • Ontology for Biomedical Investigations (OBI) can provide a common framework for describing and exchanging datasets • GSCID-BRC Metadata Working Group • Leverage and harmonize with MIGS/MIMS • We have the opportunity to establish policies for metadata collection, exchange, and release that would be broadly applicable. • We are in the position to drive standards adoption • The BRCs support many pathogens that infect the same host(s) … can we exploit this fact to create specialized views and tools for interacting with the host resources from both pathogen and host perspectives? • Ontology-driven integration (GMOD, Population biology) • Small sequencing centers • Offer community a standard metadata template for isolates • Bring your own data and metadata to PATRIC for annotation, analysis, long term metadata storage and dissemination • Develop additional metadata standards and collect, store, and share additional metadata • More efficient encoding of things like alignments

  7. Presentations Ontology of Biomedical Investigations (OBI) – Bjoern Peters Infectious Disease Ontology (IDO) and extensions – Lindsay Cowell GSCID-BRC Metadata Working Group

  8. GSCID-BRC Metadata Working Group • Working group established to define common metadata standard for pathogen isolate sequencing projects • Collaboration between BRCs, GSCIDs and NIAID • Process • Collect spreadsheets, metadata examples, previous submission from sequencing projects • Core metadata fields collected by virus, bacteria and eukaryote subgroups • For each metadata field, propose: • preferred term • definition • synonyms • allowed values based on controlled vocabularies • preferred syntax • responsible provider • data category • examples • Merge recommendations from subgroups into a common core metadata using an OBI-based semantic framework • Develop recommendations for project-specific and pathogen-specific metadata fields • Harmonize with other relevant standards (MIGS/MIMS, IDO) • Establish policies and procedures for metadata submission workflows and GenBank linkage

  9. Core Metadata Examples

  10. Network Overview temporal-spatial region - independent continuant - dependent continuant - occurrent - temporal-spatial region ital - relations located_in type ID qualities denotes temporal-spatial region instance_of has_quality located_in specimen source – organism or environmental has_output has_output has_input specimen isolation process sample processing enriched NA sample specimen has_input specimen collector has_specification has_part has_part isolation protocol microorganism genomic NA microorganism is_about has_output is_about data transformations – variant detection serotype marker detect. gene detection genotype/serotype/ gene data input sample has_input has_output has_output has_output has_input has_input reagents has_input data transformations – image processing assembly data archiving process sequence data sequence data record primary data sequencing assay technician denotes equipment GenBank ID

  11. temporal-spatial region Investigation Specimen Isolation located_in Material Processing type ID qualities denotes temporal-spatial region instance_of has_quality located_in specimen source – organism or environmental has_output has_output has_input specimen isolation process sample processing enriched NA sample specimen has_input specimen collector has_specification has_part has_part isolation protocol microorganism genomic NA microorganism is_about Sequencing Assay Data Processing has_output is_about data transformations – variant detection serotype marker detect. gene detection genotype/serotype/ gene data input sample has_input has_output has_output has_output has_input has_input reagents has_input data transformations – image processing assembly data archiving process sequence data sequence data record primary data sequencing assay technician denotes equipment GenBank ID

  12. vX – row X in virus sheet - independent continuant - dependent continuant - occurrent - temporal-spatial region ital - relations common name temporal interval date/time v3-4 v5-6 v29 v31 v43 v40 v42 v45 v46 v32 v24 v22 v44 v30 v10 v12 v27 v13 v23 v25 v15 v16 v11 v8 v7 v2 denotes denotes has_part located_in spatial region GPS location species/ strain organism ID age, gender, symptom temporal-spatial region instance_of has_quality located_in denotes denotes spatial region geographic location amount located_in organism has_input specimen source role has_quality plays ID ID ID has_output has_output has_output has_input has_input environmental material specimen isolation process NA enrichment process enriched NA sample cDNA synthesis process cDNA sample specimen specimen capture role equipment has_specification has_part has_specification has_part has_specification plays specimen collector role person isolation protocol NA enrichment protocol microorganism genomic NA cDNA synthesis protocol microorganism has_affiliation denotes instance_of affiliation name is_about is_about species/ strain has_output data transformations – variant detection serotype marker detect. gene detection genotype/serotype/ gene data temporal-spatial region template role sample material has_input has_input plays located_in has_output has_output reagent role has_output has_input has_input material data transformations – image processing assembly data archiving process sequence data sequence data record primary data sequencing assay sequencing tech. role person has_input has_specification has_specification has_specification denotes signal detection role equipment GenBank ID sequencing protocol data transfer protocol software algorithm

  13. Metadata Categories Investigation Specimen Isolation Specimen Processing Sample Shipment Pathogen Detection & Isolation Sequencing Sample Preparation Sequencing Assay Data Transformation

  14. Specimen Isolation vX – row X in virus sheet - independent continuant - dependent continuant - occurrent - temporal-spatial region ital - relations v5-6 v12 v19 v17 v10 v18 v16 v15 v13 v27 v11 v7 v9 v2 v8 v3-4 temporal interval date/time denotes has_part spatial region GPS location temporal-spatial region common name located_in located_in denotes denotes spatial region geographic location species/ strain organism ID age, gender, symptom temporal interval date/time denotes has_part denotes spatial region GPS location temporal-spatial region instance_of has_quality located_in denotes spatial region geographic location located_in Comments organism ID specimen source role plays ???? denotes environmental material has_input specimen isolation procedure X instance_of has_output specimen X specimen type has_input specimen capture role plays equipment has_part is_about has_specification has_authorization organism part hypothesis microorganism specimen collector role plays person instance_of isolation protocol IRB/IACUC approval instance_of has_affiliation denotes specimen isolation procedure type affiliation name species/ strain

  15. Specimen Processing v24 v27 v16 v15 v20 v23 v22 GPS location geographic location GPS location geographic location date/time date/time denotes denotes denotes denotes denotes denotes located_in located_in spatial region spatial region temporal interval spatial region spatial region temporal interval has_part has_part specimen T aliquot U temporal-spatial region temporal-spatial region species/ strain aliquoting process sample set assembly process specimen M aliquot N instance_of located_in located_in microorganism X instance_of instance_of specimen A aliquot B has_part has_output has_output has_input has_input sample set assembly process X sample set X aliquoting process X specimen X aliquot Y specimen X has_specification has_specification sample set assembly protocol aliquoting protocol denotes denotes denotes instance_of has_quality instance_of has_quality instance_of has_quality ID ID ID specimen type amount specimen type specimen type amount amount

  16. Sample Shipment v25 v24 v23 v21 GPS location geographic location GPS location geographic location date/time date/time denotes denotes denotes denotes denotes denotes located_in located_in spatial region spatial region temporal interval spatial region spatial region temporal interval has_part has_part sample shipment protocol sample receipt protocol temporal-spatial region temporal-spatial region has_specification has_specification ID sample type sample shipment process sample receipt process located_in amount located_in denotes instance_of instance_of instance_of has_quality has_output has_output has_input has_input has_part sample X at GSC sample shipment process X sample set X in transit sample receipt process X sample set X at GSC sample set X denotes denotes denotes instance_of has_quality instance_of has_quality instance_of has_quality ID ID ID sample set type sample set type sample set type amount amount amount

  17. Pathogen Detection & Isolation v34 v28 v26 v27 v16 v15 GPS location geographic location date/time denotes denotes denotes ID pathogen type located_in amount GPS location geographic location date/time spatial region spatial region temporal interval denotes denotes denotes denotes instance_of has_quality located_in has_part spatial region spatial region pathogen detection protocol temporal interval pathogen isolate X temporal-spatial region has_part has_specification has_output located_in temporal-spatial region pathogen isolation method pathogen detection protocol pathogen isolation process X has_specification pathogen detection method instance_of located_in has_input specimen type instance_of instance_of has_input pathogen detection process X specimen X denotes ID has_part has_quality has_output amount is_about microorganism X data about pathogen presence instance_of species/ strain

  18. Sequencing Sample Preparation v33 v27 v16 v15 v35 v36 v39 v38 v37 GPS location geographic location GPS location geographic location GPS location geographic location date/time date/time date/time denotes denotes denotes denotes denotes denotes denotes denotes denotes located_in located_in located_in spatial region spatial region temporal interval spatial region spatial region temporal interval spatial region spatial region temporal interval has_part has_part has_part temporal-spatial region temporal-spatial region temporal-spatial region species/ strain aliquoting process NA enrichment process cDNA synthesis process instance_of located_in located_in located_in microorganism genomic NA microorganism X instance_of instance_of instance_of ID has_part has_part has_output has_output has_output has_input has_input has_input NA enrichment process X enriched NA sample X cDNA synthesis process X cDNA sample X aliquoting process X specimen aliquot X specimen X has_specification has_specification has_specification NA enrichment protocol cDNA synthesis protocol aliquoting protocol denotes denotes denotes denotes instance_of has_quality instance_of has_quality instance_of has_quality instance_of has_quality ID ID ID ID specimen type amount specimen type specimen type specimen type amount amount amount

  19. Sequencing Assay v41 v40 v14 sample type instance_of sample material X sample ID denotes GPS location geographic location template role plays date/time denotes denotes located_in reagent type spatial region temporal interval spatial region instance_of has_input lot # material X denotes has_part reagent role plays temporal-spatial region has_input located_in has_output species instance_of primary data sequencing assay X has_input person X name denotes sequencing tech. role has_specification insatnce_of plays denotes has_input equipment type instance_of sequencing assay type sequencing protocol run ID equipment X serial # denotes has_part signal detection role plays objectives – coverage, genome type targeted

  20. Data Transformations GPS location geographic location v44 v31 v45 v32 v42 v47 v43 v29 v46 v30 GPS location geographic location date/time date/time denotes denotes denotes denotes located_in located_in spatial region temporal interval spatial region spatial region temporal interval spatial region has_part has_part temporal-spatial region temporal-spatial region algorithm run ID located_in has_specification located_in denotes software has_input data transformations – image processing assembly X has_output data archiving process has_output sequence data sequence data record primary data has_input has_input species instance_of has_specification denotes person X name has_input GenBank ID denotes data transfer protocol has_input data transformations – variant detection plays is_about bioinformatics tech. role data transformations – serotype marker detection has_output has_input microorganism genomic NA genotype data has_output is_about data transformations – gene detection part_of serotype data instance_of species/ strain has_output microorganism X gene data

  21. Investigation - independent continuant - dependent continuant - occurrent - temporal-spatial region ital - relations has_part investigation has_part study design has_part documenting has_part study design execution has_specified_input has_part objective specification has_part has_part Information content entity has_part specimen preparation for assay sequencing assay data transformation specimen creation

  22. Generic Assay analyte X sample type instance_of has_part sample material X sample ID denotes GPS location geographic location date/time target role plays has_quality denotes denotes located_in quality x spatial region temporal interval spatial region reagent type instance_of has_input has_part lot # material X denotes temporal-spatial region reagent role plays has_input located_in has_output is_about species instance_of primary data input sample material X assay X has_input person X name denotes technician role has_specification instance_of plays denotes has_input equipment type instance_of assay type assay protocol run ID equipment X serial # denotes has_part signal detection role plays objectives

  23. Generic Material Transformation sample type instance_of sample material X sample ID denotes GPS location geographic location target role plays has_quality date/time denotes denotes quality x located_in reagent type spatial region temporal interval spatial region instance_of has_input lot # material X denotes has_part reagent role plays temporal-spatial region has_input quality x located_in has_quality has_output species instance_of output material X material transformation X denotes sample ID has_input person X name denotes instance_of material type technician role has_specification instance_of plays denotes has_input equipment type instance_of material transformation type material transformation protocol run ID equipment X serial # denotes has_part signal detection role plays objectives

  24. Generic Data Transformation GPS location geographic location date/time denotes denotes located_in spatial region temporal interval spatial region has_part temporal-spatial region located_in software has_input has_output output data input data data transformation X has_specification instance_of denotes person X name denotes plays is_about run ID data transformation type algorithm data analyst role material X

  25. Generic Material (IC) GPS location GPS location geographic location geographic location date/time date/time denotes denotes denotes denotes denotes denotes located_in located_in spatial region spatial region spatial region spatial region temporal interval temporal interval has_part has_part temporal-spatial region temporal-spatial region located_in located_in quality x quality y has_quality has_quality material type instance_of material X denotes ID has_part has_part material Y material Z

  26. Discussion Points MIBBI may not be sufficient Don’t distinguish between minimum information to reproduce and experiment and the minimum information to structure in a database Lack a semantic framework OBI-based framework is re-usable Sequencing => “omics” Challenge of using ontologies for preferred value sets Can be large May not directly match common language Value of defining the semantic framework Appropriate relations are retained How can we take advantage of the framework for semantic query and inferential analysis? Practical issues about implementation strategies

More Related