1 / 22

Facilitating the development of controlled vocabularies for metabolomics with text mining

Facilitating the development of controlled vocabularies for metabolomics with text mining. I. Spasić, 1 D. Schober, 2 S. Sansone, 2 D. Rebholz-Schuhmann, 2 D. Kell, 1 N. Paton 1 and the MSI Ontology Working Group Members 3

eamon
Download Presentation

Facilitating the development of controlled vocabularies for metabolomics with text mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Facilitating the development of controlled vocabularies for metabolomics with text mining I. Spasić,1 D. Schober,2 S. Sansone,2D. Rebholz-Schuhmann,2D. Kell,1 N. Paton1and the MSI Ontology Working Group Members3 1 MCISBhttp://www.mcisb.org2 EBIhttp://www.ebi.ac.uk3 MSIhttp://msi-workgroups.sf.net

  2. Motivation • experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology & bioinformatics • controlled vocabularies and ontologies play a crucial role in consistent interpretation and seamless integration of information scattered across public resources • the pressing need for vocabularies and ontologies formetabolomics

  3. Metabolomics Society • http://www.metabolomicssociety.org • the most recent community-wide initiative to coordinate the efforts in standardising reporting structures of metabolomics experiments • five working groups: • biological sample context • chemical analysis • data analysis • ontology • data exchange

  4. MSI OWG • Metabolomics Standardisation Initiative Ontology WG • http://msi-ontology.sourceforge.net • msi-workgroups-ontology@lists.sourceforge.net • coordinated by Dr Susanna-Assunta Sansone • develop a common semantic framework for metabolomics studies by means of • controlled vocabularies • ontologies so to be able to: • describe the experimental process consistently • ensure meaningful and unambiguous data exchange

  5. Scope • the coverage of the domain reflects the typical structure of metabolomics investigations: • general components (investigation design; sample source, characteristics, treatments and collection; computational analysis) • technology-specific components (sample preparation; instrumental analysis; data pre-processing) • analytical technologies: mass spectrometry (MS),gas chromatography-mass spectrometry (GC-MS), liquid chromatography-mass spectrometry (LC-MS), nuclear magnetic resonance (NMR) spectroscopy…

  6. Terms • terms: • linguistic representations of domain-specific concepts • means of conveying scientific and technical information • CV terms: • used to tag units of information so that they can be more easily retrieved by a search • improve technical communication by ensuring that everyone is using the same term to mean the same thing

  7. Term acquisition • CV terms are chosen and organised by trained professionals who possess expertise in the subject area • in a rapidly developing domain of metabolomics, new analytical techniques emerge regularly, thus often compelling domain experts to use non-standardised terms • problem: manual term acquisition approaches are time-consuming, labour-intensive and error-prone • solution: a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a CV with terms already in use in the scientific literature

  8. Strategy • each CV is compiled in an iterative process consisting of the following steps: • create an initial CV by re-using the existing terminologies from database models, glossaries, etc. and normalise the terms according to the common naming conventions • expand the CV with other frequently co-occurring terms identified automatically using text mining over a relevant corpus of scientific publications • circulate the proposed CV to the practitioners in the relevant area of metabolomics for validation in order to ensure its quality and completeness

  9. A text mining workflow • information retrieval: gather a technology-specific corpus of documents search terms: MeSH terms & CV termsdocuments: abstracts & full papersresources: Entrez — MEDLINE & PubMed Central (PMC) • term recognition: extract terms as lexical units frequently occurring in a domain-specific corpus method: C-value provided by NaCTeM • term filtering: filter out terms not directly related to a given technology, such as those denoting substances, organisms, organs, diseases, etc. resources: UMLS — MetaThesaurus & Semantic Network

  10. Information retrieval using MeSH terms • MeSH = Medical Subject Headings • http://www.nlm.nih.gov/mesh/ • MeSH is the NLM's CV used for indexing articles for MEDLINE/PubMed • MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts

  11. IR using MeSH terms • finding the relevant MeSH terms using the MeSH browser • http://www.nlm.nih.gov/mesh/MBrowser.html • look up: NMR • resulting MeSH term(s): Magnetic Resonance Spectroscopy • PubMed query: Magnetic Resonance Spectroscopy [MeSH Terms]

  12. Beyond MeSH terms MEDLINE(abstracts) • NMR (or any other analytical technique used in metabolomics) is rarely itself the focus of a metabolomics study it is expected only for the results discovered to be reported in an abstract and not for the experimental conditions leading to these results • the experimental conditions are typically reported within “Materials & Methods” sections or as part of the supplementary material it is important to process the full text articles as opposed to abstracts only • as a consequence, an IR approach based on MeSH terms or search terms limited to abstracts will result in a low recall (i.e. many of the relevant articles will be overlooked)  NMR NMR  PubMed Central(full papers) NMR NMR  biomedical literature

  13. Selecting search terms 2400

  14. Selecting documents doc ID number of matching terms > threshold = 3 local corpus

  15. Term recognition: C-value • http://www.nactem.ac.uk/batch.php

  16. C-value • syntactic pattern matching used to select term candidates: (ADJ | N)+ | ((ADJ | N)* [N PREP] (ADJ | N)*) N • termhood of each candidate term t is calculated using: • |t| its length as the number of words • f(t) its frequency of occurrence • S(t) the set of other candidate terms containing it as a subphrase

  17. C-value results

  18. Unified Medical Language System (UMLS) • UMLS = an “ontology” which merges information from over 100 biomedical source vocabularies • http://umlsks.nlm.nih.gov • UMLS contains the following semantic classes relevant to our problem: Organism A.1.1Anatomical Structure A.1.2Substance A.1.4Biological Function B.2.2.1Injury or Poisoning B.2.3 • we used these classes to automatically extract the corresponding terms from the UMLS thesaurus

  19. Summary UMLS

  20. Results • input: 243 NMR terms & 152 GC terms • output: 5,699 NMR terms & 2,612 GC terms 0.13 16.25 2%

  21. The End

More Related