1 / 22

UniProt to MeSH mapping proteins to disease terminologies

ISMB/ECCB 2007 – Bio-Ontologies – Vienna, July 20. UniProt to MeSH mapping proteins to disease terminologies. Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey. Basic research: what is the mechanism? Epidemiological studies. Basic research: what is the mechanism?

tanaya
Download Presentation

UniProt to MeSH mapping proteins to disease terminologies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ISMB/ECCB 2007 – Bio-Ontologies – Vienna, July 20 UniProt to MeSHmapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey

  2. Basic research: • what is the mechanism? • Epidemiological studies • Basic research: • what is the mechanism? • Epidemiological studies Drug development Clinical trials • up-to-date knowledge and large-scale results: • research direction • New hypothesis Basic research results stored in databases Health problem in a patient • Bioinformatics: • Data storage and representation • Large-scale data generation • Large-scale data analysis Clinical patient care: Doctor prescribes an individualized treatment plan. Treatment outcome Molecular-level decision-support tools: • Structured knowledge representations • ‘Filtered’ information on fundamental biological mechanisms and significant The role of bioinformatics in biomedical research and future clinical patient care Bio-Ontologies –ISMB 2007

  3. Biomedical knowledge: a protein-centric view Disease: Pathology, diagnosis/prognosis, Treatment, risk factor Biological processes: Biological pathway/network, Protein-protein interaction Proteins: Sequence, Function, structure, modifications Genes: Sequence, chromosomal location, regulation, expression Bio-Ontologies –ISMB 2007

  4. Biomedical knowledge: a protein-centric view • Disease annotation: • Link to 12,603 OMIM entries • Link to other specialized databases • 32,921 variants (or polymorphisms) • >3’000 associated diseases Disease: Pathology, diagnosis/prognosis, Treatment, risk factor Biological processes: Biological pathway/network, Protein-protein interaction High quality manual annotation. Protein name, sequence, function, Domain, features and references. 16,702 human proteins • Biological process/proteomic: • Pathway annotation • Protein-protein interaction (DIP, INTACT) • protein 2D gel (Swiss-2DPAGE) Proteins: Sequence, Function, structure, modifications • Genomic data: • Genew, GeneCards, GenAtlas • Expression data (e.g. CleanEx) • Genome details: Ensembl Genes: Sequence, chromosomal location, regulation, expression References Links to >100 other databases Over 82’420 journal references Bio-Ontologies –ISMB 2007

  5. Objective Increase the accessibility of molecular biology resources to clinical researchers by indexing UniProtKB/Swiss-Prot with the MeSH terminology Bio-Ontologies –ISMB 2007

  6. Why UniProt KB/Swiss-Prot ? • Most comprehensivewarehouseof protein sequences • With a high level of annotation and highly cross-linked with other biological databases. • Includes data on more than 30’000variants, mostly c-SNPs (coding SNPs) or SAPs (Single Amino-acid Polymorphisms) • More than 3’000 Diseases associated with a protein are also described (mostly genetic diseases associated with SAPs) http://beta.uniprot.org/ Bio-Ontologies –ISMB 2007

  7. Disease annotation UniProtKB/Swiss-Prot entry P35240 Bio-Ontologies –ISMB 2007

  8. Why MeSH? • Controlled vocabulary thesaurus structured in a hierarchy of concepts • Each concept includes a set of terms -synonyms and lexical variants • MeSH is part of the UMLS, and, thus, linked to other medical terminologies • MeSH is used to index the biomedical literature Bio-Ontologies –ISMB 2007

  9. The structure of MeSH Bio-Ontologies –ISMB 2007

  10. Mapping procedure UniProtKB/Swiss-Prot entry Disease comment line Extracted disease name OMIM: title/alternative titles Exact match Exact match Partial match Partial match Same descriptor MeSH Bio-Ontologies –ISMB 2007

  11. Disease extraction Extraction using regular expressions ‘are the cause of’ ‘involved in’ etc. MeSH ‘Neurofibromatosis 2’ Bio-Ontologies –ISMB 2007

  12. Term matching procedure • Exact matches: same length, same word order, case insensitive • Partial matches: calculation of a similarity score between terms based of the IDF used in information retrieval: The term with the highest score was chosen. Bio-Ontologies –ISMB 2007

  13. Benchmark 92 disease names from 43 Swiss-Prot entries manually mapped to MeSH terms • Used to evaluate the procedure in terms of recall and precision • Used to set up a score threshold Bio-Ontologies –ISMB 2007

  14. Results on the Benchmark Bio-Ontologies –ISMB 2007

  15. Analysis of the results (1/3) • Problems in granularity difference Disease ‘muscle-eye-brain disease’ Manual mapping Automatic mapping ‘abnormalities, multiple’ ‘muscle liver brain eye nanism’ MeSH term Bio-Ontologies –ISMB 2007

  16. Analysis of the results (2/3) • Problems in disease name extraction Disease (extracted) ‘hematopoietic tumors such as b-cell lymphomas’ Manual mapping Automatic mapping ‘hematologic neoplasms’ ‘b-cell lymphoma’ MeSH term Bio-Ontologies –ISMB 2007

  17. Analysis of the results (3/3) • Problems inherent to the resources ‘epidermolysis bullosa simplex, Weber-Cockayne type’ Disease SP Disease (OMIM alternative title) ‘epidermolysis bullosa dystrophica, Cockayne-Touraine type’ Manual mapping Automatic mapping ‘epidermolysis bullosa dystrophica’ ‘epidermolysis bullosa simplex’ MeSH term Bio-Ontologies –ISMB 2007

  18. Results on all Swiss-Prot Bio-Ontologies –ISMB 2007

  19. Discussion • The mapping system was tuned for high precision to provide a fully automated procedure. • But we need to improve therecall by: • Including NLP techniques in the disease extraction and matching procedures; • Refining the score with other parameters (e.g. coming from information from the hierarchical structure of the MeSH) • Permitting a mapping to several MeSH terms; • Trying to map to other terminologies such as ICD-10, SnoMed-CT; • Using information from the literature which is indexed with MeSH terms. Bio-Ontologies –ISMB 2007

  20. Work in progress • Benchmark extended to 200 diseases Bio-Ontologies –ISMB 2007

  21. Work in progress • Extract MeSH terms using full text from disease comment lines + references in Swiss-Prot + references in OMIM  calculate frequency • This frequency is used to refine the score for partial match Preliminary results: The recall was successfully increased to 62 % without losing precision. Bio-Ontologies –ISMB 2007

  22. Conclusion • We developped a generic terminology mapping procedure which can be used to link various biomedical resources. • Indexing UniProtKB with medical terms opens new possibilities of searching and mining data relevant for clinical research. • These results will help improve the interoperability between medical informatics and bioinformatics Bio-Ontologies –ISMB 2007

More Related