Answering Gene Ontology terms to proteomics questions by supervised macro reading in MEDLINE

Julien Gobeill1, Emilie Pasche2, Douglas Teodoro2, Anne-Lise Veuthey3, Patrick Ruch1 1 University of Applied Sciences, Information Sciences, Geneva 2 Hospitals and University of Geneva, Geneva 3 Swiss-Protgroup, Swiss Institute of Bioinformatics, Geneva Answering Gene Ontology terms to proteomics questions by supervisedmacro reading in MEDLINE

Data deluge… “Whatis the subcellular location of protein MEN1 ? ” “What molecular functions are affected by Ryanodine ?”

Ontology-based search engines

Question Answering (EAGLi system) Redundancy hypothesis: The number of associated/co-occurring answers dominate other dimensions

Best way for extracting GO terms from a set of abstracts ? (1/3) • Comparisonbased in twocategorizers : • Thesaurus-Based (EAGL) • CompetitivewithMetaMap(Trieschnigg et al., 2009) • Computelex. similaritybetweentext and GO terms • Machine Learning (GOCat) • k-NN • Similaritybetweeninpurtextand alreadycurated abstracts • KB derivedfrom GOA : ~90’000 instances

Best way for extracting GO terms from a set of abstracts ? (2/3) • Twotasks : • Classical categorization(micro reading ~ biocuration) • Redundancy-based QA (macro reading) one abstract/paper GO terms a set of n (=100) abstracts Σ GO terms

Best way for extracting GO terms from a set of abstracts ? (3/3) • One benchmark for micro readingevaluation • 1’000 abstracts and GO descriptorsfrom GOA • Two benchmarks for macro readingevaluation • 50 questions derived from a set of biological databases: What molecular functions are affected by [chemical] ? What cellular component is the location of [protein] ?

Results + 75/120% for k-NN (sup. learning) • Redundancyhypothesisinsufficient Why/Whereis the power ? Size does or does not matter ?

Delugeis self-compensated GOCat EAGL

Magic ! The automatic categorization based on a PMID2007 performed in 2011 is of higher quality than a categorization on the same PMID2007 performed in 2007 No concept drift at all and even some improvement!

Example in toxicogenomics: CTD vs. GOCat “What molecular functions are affected by Ryanodine ?” GOCat        

Example in UniProt “Whatis the subcellular location of protein MEN1 ? ” GOCat        

Qualitative evaluation Relevant vs irrelevant : 82% - 18% Guha R., Gobeill J. and Ruch P. Automatic Functional Annotation of PubChem BioAssays

Conclusion and future work • Automatic assignment of GO categories ~ 43% • [Camon et al 2003: GO kappa ~ 40%] • Classification model improves faster than drift • [ Consistency of annotation guidelines ] • Next: Effective integration into the EAGLi’ question-answering platform

Collaborations • Automatic Functional Annotation of PubChem BioAssays • Generates semantic similarity clusters • Automatically populating large protein datasets Geneswithunvalidatedpredictedfunctions

Please visit EAGLi, the Bio-medical question answering engine http://eagl.unige.ch/EAGLi/ !

The Gene Ontology Categorizer: http://eagl.unige.ch/GOCat/ Other resources… TWINC (patent retrieval…) http://bitem.hesge.ch

Acknowledgments • Swiss-prot group (SIB): Anne-LiseVeuthey, YoannisYenarios • U. Indiana/SCRIPPS: RajarshiGuha / Stephan Schurer • The COMBREX project: Martin Steffen • NextProt: Pascale Gaudet • SNF Grant: EAGL # 120758 • EU FP7: www.KHRESMOI.eu # 257528

Answering Gene Ontology terms to proteomics questions by supervised macro reading in MEDLINE

Answering Gene Ontology terms to proteomics questions by supervised macro reading in MEDLINE

Presentation Transcript

GO: The Gene Ontology

Answering Reading Open-Ended Questions

Answering Questions

Ontology-Driven Question Answering and Ontology Quality Evaluation

Answering Questions

Introduction to Gene Ontology annotation resources

Answering Gene Ontology terms to proteomics questions by supervised macro reading in MEDLINE

Introduction to the Gene Ontology

Gene Ontology Analysis

Introduction to Proteomics CSC8309 - Gene Expression and Proteomics

GO Tag: Assigning Gene Ontology Labels to Medline Abstracts

MAPPING OF SEQUENCES TO GENE ONTOLOGY

Answering Questions

Answering Questions

Answering Questions

Answering Questions by Computer

The Ontology of the Gene Ontology

Answering English Questions by Computer

Tips on Answering Questions Related To Reading Comprehensions

Answering questions

The Gene Ontology Project