220 likes | 368 Views
EVS Data Curation. The processing and publication of data for web browsing and programmatic access. Data Curation Flowchart. Gene Ontology and Zebrafish. Downloaded as OBO from web sites Processed with C++ program into Ontylog xml – OBO2TDE.exe
E N D
EVS Data Curation The processing and publication of data for web browsing and programmatic access
Gene Ontology and Zebrafish • Downloaded as OBO from web sites • Processed with C++ program into Ontylog xml – OBO2TDE.exe • Processed with C++ program into OWL – ontyxToOWL.exe • Loaded using LoadNCIThesOWL.sh • Metadata loaded using LoadMetadata • Hierarchy and Sources manually edited
HL7 and VA_NDFRT • Retrieved from sources • Processed by Apelon into Ontylog XML • Loaded into LexBIG using LoadNCIThesOwl and manifest • Metadata loaded using LoadMetadata
MGED • OWL file downloaded from source web site • Loaded into Protégé • Classified • Inferred version exported as OWL file • Loaded into LexBIG using LoadNCIThesOwl • Metadata loaded using LoadMetadata • Hierarchy and Sources manually edited
Snomed, MedDRA and LOINC • Extracted from the UMLS into RRF files • Loaded into LexBIG using LoadUMLSFiles • Metadata loaded using LoadMetadata
UMLS Semnet • Downloaded from UMLS Semnet web site • Loaded using LoadUMLSSemnet • Metadata loaded using LoadMetadata
Metathesaurus • Load from UMLS into MEME • NCI Thesaurus imported monthly • Other vocabs added or removed • NCI specific edits made to data and relations • Exported as RRF • Imported to LexBIG using LoadNCIMeta • Metadata loaded using LoadMetadata
Preparing TDE Thesaurus for MEME • Thesaurus Ontylog XML baseline is processed through C++ app publishMEME.exe • Current baseline compared to previous to get summary of new properties or roles • Summary used to create import configuration file • Baseline imported into MEME
NCI Thesaurus from TDE • Edited in TDE and exported to Ontylog XML by name • Run through publishTDE to remove unpublishable properties • run through OntyxToOwl.exe to create OWL file by code • Loaded into LexBIG using LoadNCIThesOWL • Metadata loaded using LoadMetadata • History generated from TDE baseline • History loaded using LoadNCIHistory
NCI Thesaurus from Protege • Run OWL through application to get Ontylog XML by name • Run Ontylog XML through publishTDE to remove unpublishable properties • Run through OntylogtoOWL to get OWL by code • Do history using the Ontylog XML
NCI Thesaurus History Processing • evs_history records concept modifications made in editor • These records are extracted monthly to consolidate and to remove identifying information • Cleaned records are loaded into concept_history • Full concept_history loaded into LexBIG for NCI Thesaurus
log.out New concepts created through Create or Split actions: C72675|Feet_First . Concepts merged into other concepts: C17841|Oncologic_Surgeon . Retired concepts (including merged): C17841|Oncologic_Surgeon . New concepts not found in BSLN2: C73140|Ethaverine_ . Retired concepts not found in BSLN2 C73401|Maqui_Berry_Flavor . Modify records correponding to Retired_Kind are discarded: 667487|C62920|Medical_Device_Unsafe_to_Use|Modify|2008-03-05 … . Modify records correponding to new codes are discarded: 666753|C72831|Pramiracetam_Hydrochloride|Modify|2008-02-29 … . Modify records correponding to merged codes are discarded: 668629|C3824|Lesion|Modify|2008-03-06 11:03:49.0|remennik|6116otsaremennl.nci.nih.gov|(null)|0 . Records correponding to codes not found in BSLN2 are discarded: 671933|C73140|Ethaverine_|New|2008-03-19 12:03:01.0|shaiu|MSDCorp-Mesh001.inside.msdinc.com|(null)|0 . WARNING: New codes created, then retired, but still found in BSLN2: (to be edited manually) C72675|Feet_First . List of all remaining records . List of all discarded records: 666753|C72831|Pramiracetam_Hydrochloride|Modify|2008-02-29 09:02:56.0|shaiu|MSDCorp-Mesh001.inside.msdinc.com|(null)|0 .
tde_history_report.txt Spilanthes_oleracea (Code: C72446) Number of modelers: 3 Modeler: shaiu Modeler: thomas Modeler: creech Modeler: shaiu Action: modify time: 2008-03-05 05:03:58.0 Modeler: thomas Action: modify time: 2008-03-06 02:03:05.0 Action: modify time: 2008-03-14 10:03:06.0 Modeler: creech Action: modify time: 2008-03-06 02:03:06.0 ------------------------------------------------------------------ . Edited actions for the following concepts are discarded: Concept codes requiring manual review:
DTS_history • DTS_history_script.sql insert into concept_history(concept, editaction, editdate, reference) values ('C72675', 'create', '28-MAR-08', null); insert into concept_history(concept, editaction, editdate, reference) values ('C72676', 'create', '28-MAR-08', null); . . • DTS_history_out.txt 666540|C72675|create|28-MAR-08|(null) 666541|C72676|create|28-MAR-08|(null) 666542|C62171|modify|28-MAR-08|(null) . .
DTS_history_out.out Lists complete contents of both baselines . Number of codes in {baseline A} : 65265 Number of codes in {baseline B} : 66022 Concepts found in {baseline B}: but not in {baseline A} C72675 C72676 . Concepts found in {baseline A}: but not in {baseline B} (should be empty) . Verify DTS_history_out.txt against baseline data. New Concepts: 757 (1) C72675 (2) C72676 . Concepts created through Split: 0 Split Concepts: 0 Retired Concepts: 4 (1) C20920 (2) C62920 Concepts retired through Merge: 5 (1) C14142 Merge Concepts: 5 (1) C1363 Modified Concepts: 1364 Invalid actions: 0
Tiered Deployments • NCICB uses 4-tiered deployments • Dev tier – used internally by EVS team to test software and data • QA tier – used by QA and other software teams to test against new EVS software or data • Stage tier – used to test software deployments in a near-production environment • Production – available to outside users