220 likes | 320 Views
ISMB/ECCB 2007 – Bio-Ontologies – Vienna, July 20. UniProt to MeSH mapping proteins to disease terminologies. Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey. Basic research: what is the mechanism? Epidemiological studies. Basic research: what is the mechanism?
E N D
ISMB/ECCB 2007 – Bio-Ontologies – Vienna, July 20 UniProt to MeSHmapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey
Basic research: • what is the mechanism? • Epidemiological studies • Basic research: • what is the mechanism? • Epidemiological studies Drug development Clinical trials • up-to-date knowledge and large-scale results: • research direction • New hypothesis Basic research results stored in databases Health problem in a patient • Bioinformatics: • Data storage and representation • Large-scale data generation • Large-scale data analysis Clinical patient care: Doctor prescribes an individualized treatment plan. Treatment outcome Molecular-level decision-support tools: • Structured knowledge representations • ‘Filtered’ information on fundamental biological mechanisms and significant The role of bioinformatics in biomedical research and future clinical patient care Bio-Ontologies –ISMB 2007
Biomedical knowledge: a protein-centric view Disease: Pathology, diagnosis/prognosis, Treatment, risk factor Biological processes: Biological pathway/network, Protein-protein interaction Proteins: Sequence, Function, structure, modifications Genes: Sequence, chromosomal location, regulation, expression Bio-Ontologies –ISMB 2007
Biomedical knowledge: a protein-centric view • Disease annotation: • Link to 12,603 OMIM entries • Link to other specialized databases • 32,921 variants (or polymorphisms) • >3’000 associated diseases Disease: Pathology, diagnosis/prognosis, Treatment, risk factor Biological processes: Biological pathway/network, Protein-protein interaction High quality manual annotation. Protein name, sequence, function, Domain, features and references. 16,702 human proteins • Biological process/proteomic: • Pathway annotation • Protein-protein interaction (DIP, INTACT) • protein 2D gel (Swiss-2DPAGE) Proteins: Sequence, Function, structure, modifications • Genomic data: • Genew, GeneCards, GenAtlas • Expression data (e.g. CleanEx) • Genome details: Ensembl Genes: Sequence, chromosomal location, regulation, expression References Links to >100 other databases Over 82’420 journal references Bio-Ontologies –ISMB 2007
Objective Increase the accessibility of molecular biology resources to clinical researchers by indexing UniProtKB/Swiss-Prot with the MeSH terminology Bio-Ontologies –ISMB 2007
Why UniProt KB/Swiss-Prot ? • Most comprehensivewarehouseof protein sequences • With a high level of annotation and highly cross-linked with other biological databases. • Includes data on more than 30’000variants, mostly c-SNPs (coding SNPs) or SAPs (Single Amino-acid Polymorphisms) • More than 3’000 Diseases associated with a protein are also described (mostly genetic diseases associated with SAPs) http://beta.uniprot.org/ Bio-Ontologies –ISMB 2007
Disease annotation UniProtKB/Swiss-Prot entry P35240 Bio-Ontologies –ISMB 2007
Why MeSH? • Controlled vocabulary thesaurus structured in a hierarchy of concepts • Each concept includes a set of terms -synonyms and lexical variants • MeSH is part of the UMLS, and, thus, linked to other medical terminologies • MeSH is used to index the biomedical literature Bio-Ontologies –ISMB 2007
The structure of MeSH Bio-Ontologies –ISMB 2007
Mapping procedure UniProtKB/Swiss-Prot entry Disease comment line Extracted disease name OMIM: title/alternative titles Exact match Exact match Partial match Partial match Same descriptor MeSH Bio-Ontologies –ISMB 2007
Disease extraction Extraction using regular expressions ‘are the cause of’ ‘involved in’ etc. MeSH ‘Neurofibromatosis 2’ Bio-Ontologies –ISMB 2007
Term matching procedure • Exact matches: same length, same word order, case insensitive • Partial matches: calculation of a similarity score between terms based of the IDF used in information retrieval: The term with the highest score was chosen. Bio-Ontologies –ISMB 2007
Benchmark 92 disease names from 43 Swiss-Prot entries manually mapped to MeSH terms • Used to evaluate the procedure in terms of recall and precision • Used to set up a score threshold Bio-Ontologies –ISMB 2007
Results on the Benchmark Bio-Ontologies –ISMB 2007
Analysis of the results (1/3) • Problems in granularity difference Disease ‘muscle-eye-brain disease’ Manual mapping Automatic mapping ‘abnormalities, multiple’ ‘muscle liver brain eye nanism’ MeSH term Bio-Ontologies –ISMB 2007
Analysis of the results (2/3) • Problems in disease name extraction Disease (extracted) ‘hematopoietic tumors such as b-cell lymphomas’ Manual mapping Automatic mapping ‘hematologic neoplasms’ ‘b-cell lymphoma’ MeSH term Bio-Ontologies –ISMB 2007
Analysis of the results (3/3) • Problems inherent to the resources ‘epidermolysis bullosa simplex, Weber-Cockayne type’ Disease SP Disease (OMIM alternative title) ‘epidermolysis bullosa dystrophica, Cockayne-Touraine type’ Manual mapping Automatic mapping ‘epidermolysis bullosa dystrophica’ ‘epidermolysis bullosa simplex’ MeSH term Bio-Ontologies –ISMB 2007
Results on all Swiss-Prot Bio-Ontologies –ISMB 2007
Discussion • The mapping system was tuned for high precision to provide a fully automated procedure. • But we need to improve therecall by: • Including NLP techniques in the disease extraction and matching procedures; • Refining the score with other parameters (e.g. coming from information from the hierarchical structure of the MeSH) • Permitting a mapping to several MeSH terms; • Trying to map to other terminologies such as ICD-10, SnoMed-CT; • Using information from the literature which is indexed with MeSH terms. Bio-Ontologies –ISMB 2007
Work in progress • Benchmark extended to 200 diseases Bio-Ontologies –ISMB 2007
Work in progress • Extract MeSH terms using full text from disease comment lines + references in Swiss-Prot + references in OMIM calculate frequency • This frequency is used to refine the score for partial match Preliminary results: The recall was successfully increased to 62 % without losing precision. Bio-Ontologies –ISMB 2007
Conclusion • We developped a generic terminology mapping procedure which can be used to link various biomedical resources. • Indexing UniProtKB with medical terms opens new possibilities of searching and mining data relevant for clinical research. • These results will help improve the interoperability between medical informatics and bioinformatics Bio-Ontologies –ISMB 2007