1 / 41

TM Web services: Whatizit, CiteXplore

Explore EBI's text mining services for linking targets to diseases through named entity recognition, relation identification, and semantic integration of literature. Discover how EBI extracts knowledge from text to identify gene-disease associations and generate hypothesis in bioinformatics research.

rmanson
Download Presentation

TM Web services: Whatizit, CiteXplore

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EBI TM services: mapping targets to diseasesSeptember 3rd, 2009Dietrich Rebholz-Schuhmann, MD, PhDGroup Leader Rebholz GroupEBI, WT Genome CampusHinxton, Cambridge, U.K.

  2. TM at the EBI: current developments • TM Web services: Whatizit, CiteXplore • One of EBI’s major services: 11,000 hits per day, 400 MB data transfer • Ongoing integration into public services (UKPMC) • Research around new developments and Quality assurance • Working towards a knowledge infrastructure from literature • Named entity recognition: most progress • Relation / event identification • Repository of inferred knowledge: functional annotation of genes, diseases, gene-disease associations, relation identification • Exploitation of semantic resources (ontologies)

  3. The magic transformation from text to semantics Concepts Ideas Facts Relationships Events “Knowledge” ?

  4. How far can we go? Automatic + full integration with database resources? => Mainly entities + concepts ? Automatic generation of paper summaries => Extraction of facts + events Extraction of new knowledge => Generate hypothesis first Let the authors do it all => Do not use papers anymore

  5. Idealized R&D stages (overview) Genes/ProteinsChemical entitiesDiseasesGO/MeSH termsBioLexicon Gene regulationontology Ternary relations Functions of proteinsGene-diseaseassociations WhatizitIeXML Integration of literature into bioinformatics IT services 2006 2008 2009 Time Semanticssupport Named entity recognition / grounding Identificationof relations Interoperabilityof literature and text mining

  6. Document Entities Concepts Tokens Facts

  7. “The function of OmpR appears to be the enhancement of a basal level of ompC expression” basal level of ompCexpression OmpRompC … the of appear … OmpR increases ompC expression

  8. Gene normalisation SwissProt Biolexicon, human Best performance=> 100% Precision=> 100% Recall Performance is state of the ArtResults are nottuned to theBioCreAtIve IIcorpus Pezik et al., Proc. LREC Workshop, 2008

  9. “The function of OmpR appears to be the enhancement of a basal level of ompC expression” basal level of ompCexpression OmpRompC … the of appear … OmpR increases ompC expression

  10. Entities + concepts SwissProt Biolexicon, human Chemicals entities Disease NER MeSH terms Go terms(@ Rank 1) All solutions are state of the art Jimeno et al., BMC Bioinformatics, 2008

  11. “The function of OmpR appears to be the enhancement of a basal level of ompC expression” basal level of ompCexpression OmpRompC … the of appear … OmpR increases ompC expression

  12. Protein-protein interaction identification GREs withInference Performance not adequate,improvementsrequired “Associate” MI-PPI All NMI-PPI GREs w/oInference Rebholz-Schuhmann et al., SMBM 2008

  13. How do we find knowledge?

  14. Gene-disease associations Motivation • Some diseases have a mono-genetic cause: • For example Cystic fibrosis, sickle cell anemia, F8/F9-defects, deafness • Other diseases have a pluri-genetic cause: • Schizophrenia, stomach cancer, hypertension • Question: • Can we find molecular functions that are shared between genes and diseases?

  15. Gene-disease association pairsfrom the literature

  16. Candidate genes: Approach • Complete Medline analysis • Identify all genes/proteins (80% F-measure) • Identify all gene ontological terms (35% F-measure) • Identify all diseases (70% F-measure) • Generation of concept profiles for genes and diseases • Each vector contains the TF-IDF value of all relevant GO concepts • A GO concept is relevant if found in the context of a gene or disease • Pivoted cosine similarity • Selection of gene profiles that are most similar to disease profiles • Prioritization of gene-disease associations • Evaluation • Alternative methods: MeSH annotations and tokens • “Gold standard” data resources: OMIM, GAD, GOA • Assessment by curators

  17. Candidate genes: Evaluation, Omim/GAD Limited performance due to:- term variability- not all G2D associations are relevant to Omim

  18. Candidate genes: Validation by curators • Neither OMIM, nor GAD are complete • Curators are more able to verify putative novel knowledge • Evaluation: • Random sample of novel 30 gene disease association pairs • At least 2 out of 3 curators have to agree, use of literature resources • Verify the direct mention of the gene diseases association • Identify indirect evidence for the gene-disease pair • Verify the assignment of GO concepts

  19. Candidate genes: curator assessment 63% of gene-diseaseassociations can beconfirmed by at least 2 curators 57% of GOassignmentsdescribe thedisease and the gene

  20. P-values of GDAPs (based on cosine scores) No clear confirmation of gene-disease associations Clear confirmation of mostGDAPs

  21. Candidate genes: Outcome • Identification of 1,154 putative novel gene-disease associations from the literature • 63% (in total 727) should be reliable=> to be confirmed • 672 distinct candidate genes linked to the associations • 340 genes are also covered in GOA linked to 545 gene-disease associations • 57% of the assigned GO concepts are reliable • Interpretation of the gene-disease association • 10% of the GO concept annotations are shared with GOA

  22. Gene-disease association pairsfrom the literature

  23. Where do we move in the future?

  24. How far can we go? ? Automatic generation of paper summaries => Extraction of facts + events Let the authors do it all => Do not use papers anymore

  25. Research to drive standards Standardization of Document Formats: • IeXML • SciXML • Standardization of Content: • Genes • Chemical Entities • Medical terms • MeSH, GO terms PaperMaker: Support to authors Performance assessment on a very large corpus(FP07, support action) Bioinformatics user: Analytical pipelines

  26. UKPMC: Prospect

  27. The process • Collaborative annotation of a large-scale biomedical corpus • Five project partners annotated the first corpus(150,000 documents, different semantic types) • Reconciliation, syntax + semantics=> generate the pilot corpus • Make part of the pilot corpus available => challenge: reproduce the annotations • Close the challenge, harmonise the annotations again=> next corpus • Reopen the challenge with the second harmonized corpus

  28. The challenge 150,000 documentsor more ... Test set for all systemsAssessment, benchmarking

  29. Support to authors / readers • FEBS Letter experiment • Authors contribute to the curation work • They identify the correct entity in the DBs (gene/protein) • Curators add the protein-protein interaction to the DB (MINT) • BioCreative Meta-Server => BioCreative II.5 • BioLit (P. Bourne et all) • adding semantic data to the literature => keep it in a DB • Word plug-gin to annotate ontological terms • PaperMaker (Rebholz group) • Consistency analysis of manuscripts • Reflect , OnTheFly (Schneider group) • Annotation of documents + interlinking with DBs • Royal Society of Chemistry • Markup of text (Oscar + editors) => interlinked chemistry

  30. PaperMaker • PaperMaker - a tool to support authors writing biomedical papers: • Interactive feedback on the contents of papers (related work and concept annotations) • Formal consistency criteria checking (spelling, terminology, acronyms, references)

  31. Consistency parameters Domain-independent • General spelling and grammar • General readability • Appropriate use of references • Finding and acknowledging related work Domain-specific use of terminology: • Should be consistent with naming domain-specific guidelines • Should not be ambiguous • Should conform to the conventional usage (possible clashes between naming guidelines and common-sense convention) • Useful to resolve terminology to reference databases (e. g. UniProt for protein names, ChEBI chemical entities, etc.) • The special case of acronyms

  32. Content feedback • Resolving the contents to literature repositories • Finding related work (document retrieval) • Finding related ideas (passage retrieval) • Resolving the contents to ontological reference databases • MeSH descriptors have been demonstrated to improve biomedical information retrieval. Can we suggest MeSH terms directly to the authors? • Gene Ontology (GO) terms are increasingly used in information extraction systems.

  33. PaperMaker workflow Original manuscript text Module 1 Spell Checker Module 2 Acronym Resolution Module 3 NER Module 4 GO Recognition Module 8 Summary Module 7 Related Work Module 6 Reference Check Module 5 MeSH Annotation Modified manuscript text

  34. PaperMaker, Conclusions • PaperMaker can help the author conform to the formal requirements of paper writing with special emphasis on the domain • It also provides feedback on the contents by relating it to reference resources and literature repositories • It may improve the indexing of a paper in literature repositories (less ambiguous terminology) • http://www.ebi.ac.uk/Rebholz-srv/PaperMakerWork in progress 

  35. TM services at the EBI: Conclusions • Standardised TM solutions available, free use • Quality assurance is ongoing work, integration with EBI’s data resources • About 500 to 2,500 users, 50 GB annual data transfer • Knowledge infrastructure is work in progress • Annotations of genes, diseases • Extraction of different types of relationsCollaborations between publishers and pharmaceutical industry(SESL project)

  36. Editorial Board: Christopher Baker Olivier Bodenreider Philip Bourne Anita Burgun-Parenthoine Carol Friedman Carole Goble Udo Hahn Lynette Hirschman Jung-Jae Kim Patrick Lambrix Ulf Leser Susanna Lewis Jong C. Park Editorial Board (cont): Alan Ruttenberg Tapio Salakoski Susanna Assunto-Sansone Michael Schroeder Stefan Schulz Amnon Shabo Barry Smith Robert Stevens Toshihisa Takagi Alfonso Valencia Mark Wilkinson Limsoon Wong

  37. Acknowledgements … IeXML: G. Nenadic, Uo.Manchester CALBC: J. v.d.Lei, Rotterdam E. v.Mulligen, Rotterdam O. Bodenreider, NLM Other: M. Ashburner, Uo.Cambridge U. Leser, HUo.Berlin D. Trieschnigg, Uo.Twente F. Couto, Uo.Lisbon A. Waagmeester, Uo.Maastricht S. Jaeger, HUo.Berlin T. Grego, Uo.Lisbon A. Baillif, Uo. Clermont-Feront BootStrep: Udo Hahn, Uo.Jena E. Beisswanter, Uo.Jena K. Tomanek, Uo.Jena K. Buyko, Uo.Jena S. Ananiadou, Uo.Manchester N. Calzolari, CNRS Pisa A. Burgun, Uo.Rennes EBI: P. Stoehr, E. Dimmer, E. Camron, M. Kapushevski, H. Hermjakob, N. Luscombe, D. Clark, P. Flicek,

More Related