1 / 19

Answering Gene Ontology terms to proteomics questions by supervised macro reading in MEDLINE

Julien Gobeill 1 , Emilie Pasche 2 , Douglas Teodoro 2 , Anne-Lise Veuthey 3 , Patrick Ruch 1 1 University of Applied Sciences, Information Sciences, Geneva 2 Hospitals and University of Geneva, Geneva 3 Swiss- Prot group, Swiss Institute of Bioinformatics, Geneva.

Download Presentation

Answering Gene Ontology terms to proteomics questions by supervised macro reading in MEDLINE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Julien Gobeill1, Emilie Pasche2, Douglas Teodoro2, Anne-Lise Veuthey3, Patrick Ruch1 1 University of Applied Sciences, Information Sciences, Geneva 2 Hospitals and University of Geneva, Geneva 3 Swiss-Protgroup, Swiss Institute of Bioinformatics, Geneva Answering Gene Ontology terms to proteomics questions by supervisedmacro reading in MEDLINE

  2. Data deluge… “Whatis the subcellular location of protein MEN1 ? ” “What molecular functions are affected by Ryanodine ?”

  3. Ontology-based search engines

  4. Question Answering (EAGLi system) Redundancy hypothesis: The number of associated/co-occurring answers dominate other dimensions

  5. Best way for extracting GO terms from a set of abstracts ? (1/3) • Comparisonbased in twocategorizers : • Thesaurus-Based (EAGL) • CompetitivewithMetaMap(Trieschnigg et al., 2009) • Computelex. similaritybetweentext and GO terms • Machine Learning (GOCat) • k-NN • Similaritybetweeninpurtextand alreadycurated abstracts • KB derivedfrom GOA : ~90’000 instances

  6. Best way for extracting GO terms from a set of abstracts ? (2/3) • Twotasks : • Classical categorization(micro reading ~ biocuration) • Redundancy-based QA (macro reading) one abstract/paper GO terms a set of n (=100) abstracts Σ GO terms

  7. Best way for extracting GO terms from a set of abstracts ? (3/3) • One benchmark for micro readingevaluation • 1’000 abstracts and GO descriptorsfrom GOA • Two benchmarks for macro readingevaluation • 50 questions derived from a set of biological databases: What molecular functions are affected by [chemical] ? What cellular component is the location of [protein] ?

  8. Results + 75/120% for k-NN (sup. learning) • Redundancyhypothesisinsufficient Why/Whereis the power ? Size does or does not matter ?

  9. Delugeis self-compensated GOCat EAGL

  10. Delugeis self-compensated GOCat EAGL

  11. Magic ! The automatic categorization based on a PMID2007 performed in 2011 is of higher quality than a categorization on the same PMID2007 performed in 2007 No concept drift at all and even some improvement!

  12. Example in toxicogenomics: CTD vs. GOCat “What molecular functions are affected by Ryanodine ?” GOCat        

  13. Example in UniProt “Whatis the subcellular location of protein MEN1 ? ” GOCat        

  14. Qualitative evaluation Relevant vs irrelevant : 82% - 18% Guha R., Gobeill J. and Ruch P. Automatic Functional Annotation of PubChem BioAssays

  15. Conclusion and future work • Automatic assignment of GO categories ~ 43% • [Camon et al 2003: GO kappa ~ 40%] • Classification model improves faster than drift • [ Consistency of annotation guidelines ] • Next: Effective integration into the EAGLi’ question-answering platform

  16. Collaborations • Automatic Functional Annotation of PubChem BioAssays • Generates semantic similarity clusters • Automatically populating large protein datasets Geneswithunvalidatedpredictedfunctions

  17. Please visit EAGLi, the Bio-medical question answering engine http://eagl.unige.ch/EAGLi/ !

  18. The Gene Ontology Categorizer: http://eagl.unige.ch/GOCat/ Other resources… TWINC (patent retrieval…) http://bitem.hesge.ch

  19. Acknowledgments • Swiss-prot group (SIB): Anne-LiseVeuthey, YoannisYenarios • U. Indiana/SCRIPPS: RajarshiGuha / Stephan Schurer • The COMBREX project: Martin Steffen • NextProt: Pascale Gaudet • SNF Grant: EAGL # 120758 • EU FP7: www.KHRESMOI.eu # 257528

More Related