10 likes | 155 Views
Development of infrastructure, data standards, and best practices for the curation of pharmacogenomics data. Abstract. Results. Materials. Limitation and Future work. Conclusions. Tooling. Methods. References. Acknowledgements. 1. http://pgrn.org/display/pgrnwebsite/PGRN+Home
E N D
Development of infrastructure, data standards, and best practices for the curation of pharmacogenomics data Abstract Results Materials Limitation and Future work Conclusions Tooling Methods References Acknowledgements 1. http://pgrn.org/display/pgrnwebsite/PGRN+Home 2. http://www.ihtsdo.org/snomed-ct/ 3. S.H. Brown, P.L. Elkin, S.T. Rosenbloom, C. Husse r, B.A. Bauer, M.J. Lincoln et al. VA national drug file reference terminology: a cross-institutional content coverage Study. Medinfo, 2004 (2004), pp. 477–481 4. http://ncit.nci.nih.gov/ 5. McDonald CJ, Huff SM, Suico JG, Hill G, Leavelle D, Aller R, et al. LOINC, a universal standard for identifying laboratory observations: a 5-year update. Clin Chem. 2003;49:624–33 6. http://www.nlm.nih.gov/research/umls/rxnorm/ 7. Qian Zhu, PhD; Robert R. Freimuth, PhD; Jyotishman Pathak, PhD; Matthew J. Durski, MA; Zonghui Lian, MS; H. Scott Bauer; Christopher G. Chute, MD, DrPH Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN • Semantic Annotation Reviewer • To facilitate review of the annotation results, a web application was developed that allowed the curator select the best term(s) for annotation. This tooling will enable the standardization of PGRN data dictionaries to be completed more efficiently than using the more labor-intensive approaches that are used today. • Harmonization of value sets was out of scope for this initial study. Future work will include value set harmonization, and we anticipate adding support for this function into the infrastructure and workflows that are being developed. • Six categories were identified in this pilot study. Additional categories are expected as data dictionaries from other PGRN sites are included in the harmonization effort. • The harmonized data elements from this work will be mapped to existing standards when possible, or new standards will be proposed if necessary. • This study focused on clinical data related to lab tests, medications, and disease diagnoses. Interestingly, none of the dictionaries included in this pilot contained elements to represent genomic data; clearly an area for future work. • The standardization of genomics data is an active topic in many organizations. We are working with the HL7 Clinical Genomics work group, CDISC, and the NCI Information Representation work group to leverage existing efforts. • Future releases of the harmonization infrastructure will include role-based functions and support for curation workflows. • The Pharmacogenomics Research Network (PGRN)1is a collaborative partnership of research groups funded by the U.S. National Institutes of Health to discover and understand how the genome contributes to an individual’s response to medication. It is contributing significantly to the scientific base of knowledge in pharmacogenomics, a trend that is expected to continue. However, traditional biomedical research studies and clinical trials are being conducted independently, common and standardized representations for data are seldom used. This leads to heterogeneity in the collected data and hinders data reuse, integration and meta-analyses across multiple datasets. • Curation of pharmacogenomics data sets requires that standards exist for a given terminology, ontology, or other standard representation of data. • identifying core entities common to pharmacogenomics studies • proposing standards for representing those data • developing a workflow and supporting infrastructure to enable the efficient curation of metadata related to pharmacogenomics studies. • To support the creation workflow and assist the PGRN community in managing their data and related standards, a new software application has been developed. This tooling will enable the standardization of PGRN data dictionaries to be completed more efficiently than using the more labor-intensive approaches that are used today. • Curation workflow composed into steps • Data pre-processing, each dictionary by reformatting it and filling in missing data, loading into a MySQL database for harmonization. • Decomposition and Normalization • Variable descriptions were split into single words, which were then reassembled into phrases. • Normalized by stop word list and UMLS Specialist Lexicon7 • Semantic annotation by controlled terminologies • Manually annotation review by semantic annotation reviewer • Categorization based on UMLS semantic types and domain knowledge • Decomposition and Normalization results Data and metadata standards help to mitigate the problems that arise from semantic and syntactic differences between research groups. These differences are major barriers that hinder effective communication among scientists and that slow the pace of advancement and discovery. The proposed standard representations for pharmacogenomics data that will be produced by this study will enable the PGRN community to more effectively share and reuse their data. Furthermore, the workflow and best practices for the curation of pharmacogenomics data that are being developed for this project are generalizable to other curation efforts that require the harmonization of disparate data elements from a broad community of investigators. • Semantic Annotation Results • Variable description decomposition and normalization pipeline • The length of each phrase reassembled from the mapping components (MCs) was limited to a maximum of six single words • Removed all words that were contained in the stop words list; and removed MCs that included more than 50% stop words • Verb tense converted to a common base form, plural nouns to singular form, and possessive nouns to base forms using the LRAGR lexicon • Verbs, adjectives, and adverbs were converted to nouns using the LRNOM lexicon • PGRN Data Dictionaries • Data dictionaries were collected from PGRN research sites • Multiple formats: xls, pdf, txt, html, doc • Dictionaries from three sites were chosen for this pilot study Natural Phenomenon or Process Biologic Function Physiologic Function Organism Function Mental Process Organ or Tissue Function Cell Function Molecular Function Genetic Function Pathologic Function Disease or Syndrome Mental or Behavioral Dysfunction Neoplastic Process Cell or Molecular Dysfunction Experimental Model of Disease Injury or Poisoning • Categorization Results for 797 variables with complete annotations • UMLS Semantic Types used for categorization • Terminology Standards and Formalized Metadata • To maximize the utility of the PGRN data dictionary harmonization effort and the potential reuse of data elements, we leveraged several existing terminology, SNOMED-CT2, NDF-RT3, NCI Thesaurus4, LOINC5 and RxNorm6 http://www.ncbi.nlm.nih.gov/books/NBK9680/ This work was supported by the NIH/NIGMS (U19 GM61388; the Pharmacogenomic Research Network).