220 likes | 346 Views
Facilitating the development of controlled vocabularies for metabolomics with text mining. I. Spasić, 1 D. Schober, 2 S. Sansone, 2 D. Rebholz-Schuhmann, 2 D. Kell, 1 N. Paton 1 and the MSI Ontology Working Group Members 3
E N D
Facilitating the development of controlled vocabularies for metabolomics with text mining I. Spasić,1 D. Schober,2 S. Sansone,2D. Rebholz-Schuhmann,2D. Kell,1 N. Paton1and the MSI Ontology Working Group Members3 1 MCISBhttp://www.mcisb.org2 EBIhttp://www.ebi.ac.uk3 MSIhttp://msi-workgroups.sf.net
Motivation • experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology & bioinformatics • controlled vocabularies and ontologies play a crucial role in consistent interpretation and seamless integration of information scattered across public resources • the pressing need for vocabularies and ontologies formetabolomics
Metabolomics Society • http://www.metabolomicssociety.org • the most recent community-wide initiative to coordinate the efforts in standardising reporting structures of metabolomics experiments • five working groups: • biological sample context • chemical analysis • data analysis • ontology • data exchange
MSI OWG • Metabolomics Standardisation Initiative Ontology WG • http://msi-ontology.sourceforge.net • msi-workgroups-ontology@lists.sourceforge.net • coordinated by Dr Susanna-Assunta Sansone • develop a common semantic framework for metabolomics studies by means of • controlled vocabularies • ontologies so to be able to: • describe the experimental process consistently • ensure meaningful and unambiguous data exchange
Scope • the coverage of the domain reflects the typical structure of metabolomics investigations: • general components (investigation design; sample source, characteristics, treatments and collection; computational analysis) • technology-specific components (sample preparation; instrumental analysis; data pre-processing) • analytical technologies: mass spectrometry (MS),gas chromatography-mass spectrometry (GC-MS), liquid chromatography-mass spectrometry (LC-MS), nuclear magnetic resonance (NMR) spectroscopy…
Terms • terms: • linguistic representations of domain-specific concepts • means of conveying scientific and technical information • CV terms: • used to tag units of information so that they can be more easily retrieved by a search • improve technical communication by ensuring that everyone is using the same term to mean the same thing
Term acquisition • CV terms are chosen and organised by trained professionals who possess expertise in the subject area • in a rapidly developing domain of metabolomics, new analytical techniques emerge regularly, thus often compelling domain experts to use non-standardised terms • problem: manual term acquisition approaches are time-consuming, labour-intensive and error-prone • solution: a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a CV with terms already in use in the scientific literature
Strategy • each CV is compiled in an iterative process consisting of the following steps: • create an initial CV by re-using the existing terminologies from database models, glossaries, etc. and normalise the terms according to the common naming conventions • expand the CV with other frequently co-occurring terms identified automatically using text mining over a relevant corpus of scientific publications • circulate the proposed CV to the practitioners in the relevant area of metabolomics for validation in order to ensure its quality and completeness
A text mining workflow • information retrieval: gather a technology-specific corpus of documents search terms: MeSH terms & CV termsdocuments: abstracts & full papersresources: Entrez — MEDLINE & PubMed Central (PMC) • term recognition: extract terms as lexical units frequently occurring in a domain-specific corpus method: C-value provided by NaCTeM • term filtering: filter out terms not directly related to a given technology, such as those denoting substances, organisms, organs, diseases, etc. resources: UMLS — MetaThesaurus & Semantic Network
Information retrieval using MeSH terms • MeSH = Medical Subject Headings • http://www.nlm.nih.gov/mesh/ • MeSH is the NLM's CV used for indexing articles for MEDLINE/PubMed • MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts
IR using MeSH terms • finding the relevant MeSH terms using the MeSH browser • http://www.nlm.nih.gov/mesh/MBrowser.html • look up: NMR • resulting MeSH term(s): Magnetic Resonance Spectroscopy • PubMed query: Magnetic Resonance Spectroscopy [MeSH Terms]
Beyond MeSH terms MEDLINE(abstracts) • NMR (or any other analytical technique used in metabolomics) is rarely itself the focus of a metabolomics study it is expected only for the results discovered to be reported in an abstract and not for the experimental conditions leading to these results • the experimental conditions are typically reported within “Materials & Methods” sections or as part of the supplementary material it is important to process the full text articles as opposed to abstracts only • as a consequence, an IR approach based on MeSH terms or search terms limited to abstracts will result in a low recall (i.e. many of the relevant articles will be overlooked) NMR NMR PubMed Central(full papers) NMR NMR biomedical literature
Selecting documents doc ID number of matching terms > threshold = 3 local corpus
Term recognition: C-value • http://www.nactem.ac.uk/batch.php
C-value • syntactic pattern matching used to select term candidates: (ADJ | N)+ | ((ADJ | N)* [N PREP] (ADJ | N)*) N • termhood of each candidate term t is calculated using: • |t| its length as the number of words • f(t) its frequency of occurrence • S(t) the set of other candidate terms containing it as a subphrase
Unified Medical Language System (UMLS) • UMLS = an “ontology” which merges information from over 100 biomedical source vocabularies • http://umlsks.nlm.nih.gov • UMLS contains the following semantic classes relevant to our problem: Organism A.1.1Anatomical Structure A.1.2Substance A.1.4Biological Function B.2.2.1Injury or Poisoning B.2.3 • we used these classes to automatically extract the corresponding terms from the UMLS thesaurus
Summary UMLS
Results • input: 243 NMR terms & 152 GC terms • output: 5,699 NMR terms & 2,612 GC terms 0.13 16.25 2%