GO Tag: Assigning Gene Ontology Labels to Medline Abstracts

GO Tag: Assigning Gene Ontology Labels to Medline Abstracts Robert Gaizauskas Natural Language Processing Group Department of Computer Science

M. Ghanem, Tom Barnwell, Y. Guo Department of Computing GO Tag: Assigning Gene Ontology Labels to Medline Abstracts Robert Gaizauskas N. Davis, Y.K. Guo, H. Harkema Natural Language Processing Group Department of Computer Science J. Ratcliffe

Outline • Context • Project Background • The Gene Ontology • Go Annotation in Model Organism Databases • Medline • Go Tagging Tasks • User types/scenarios • Possible tasks • Related Work • Data sets/Gold Standards • Approaches and Results to Date • Lexical lookup • Vector Space Similarity • Machine Learning • Exploiting the Results in Search Tools NaCTeM Seminar

Project Background • Work is funded by the EPSRC as a Best Practice Project for collaboration between DiscoveryNet and myGrid -- E-Science Pilot Projects (2001-5) • Both projects • have developed text mining and data analysis components -- complementary approaches NLP vs. datamining/statistical analysis • workflow models for co-ordinating distributed services • working on life science applications • Aim: to develop a unified real-time e-Science text-mining infrastructure that builds upon and extends the technologies and methods developed by both Discovery Net and myGrid • Software engineering challenge: integrate complementary service-based text mining capabilities with different metadata models into a single framework • Application challenge: annotate biomedical abstracts with semantic categories from the Gene Ontology NaCTeM Seminar

The Gene Ontology • “The Gene Ontology project provides a controlled vocabulary to describe gene and gene product attributes in any organism”http://www.geneontology.org/ • Consists of three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated: • biological processes • cellular components • molecular functions in a species-independent manner • E.g. gene product cytochrome c can be described by • the molecular function term electron transporter activity • the biological process terms oxidative phosphorylation and induction of cell death • the cellular component terms mitochondrial matrix and mitochondrial inner membrane NaCTeM Seminar

Gene Ontology (cont) From: Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium (2000) Nature Genet. 25: 25-29. NaCTeM Seminar

The Gene Ontology (cont) • Started as a joint effort between three model organism databases (FlyBase (Drosophila), the Saccharomyces Genome Database (SGD) and the Mouse Genome Database (MGD)) • GO now (08/11/05) contains 19022 terms • GO Slim(s) are reduced versions of GO ontologies containing a subset of GO terms • Aim to give a broad overview of ontology content • GO Slim Generic currrently contains 127 terms • A typical GO Term Term name: isotropic cell growth Accession: GO:0051210 Ontology: biological_process Synonyms: related: uniform cell growth Definition: “The process by which a cell irreversibly increases in size uniformly in all directions. In general, a rounded cell morphology reflects isotropic cell growth.” … NaCTeM Seminar

Evidence code Gene Go Annotation Reference TAS : Traceable Author Statement structural constituent of cytoskeleton Botstein D, et al. (1997) The yeast cytoskeleton. Pruyne D and Bretscher A (2000) Polarization of cell growth in yeast. Pruyne D and Bretscher A (2000) Polarization of cell growth in yeast. I. Establishment and maintenance Botstein D, et al. (1997) The yeast cytoskeleton. TAS : Traceable Author Statement ACT1 exocytosis Galarneau L, et al. (2000) Multiple links between the NuA4 histone acetyltransferase complex and epigenetic control of transcription IDA : Inferred from Direct Assay histone acetyltransferase complex GO Annotation in Model Organism DB’s • Model organism db’s typically record for each entry (gene) one or more GO codes + links to the literature supporting the assignment of the GO code • E.g. from the Saccharomyces Genome Database (SGD) IC: Inferred by Curator IDA: Inferred from Direct Assay IEA: Inferred from Electronic Annotation IEP: Inferred from Expression Pattern IGI: Inferred from Genetic Interaction IMP: Inferred from Mutant Phenotype IPI: Inferred from Physical Interaction ISS: Inferred from Sequence or Structural Similarity NAS: Non-traceable Author Statement ND: No biological Data available RCA: inferred from Reviewed Computational Analysis TAS: Traceable Author Statement NR: Not Recorded NaCTeM Seminar

PubMed • PubMed • on-line bibliographic database designed to provide access to citations from biomedical literature • developed by the US NCBI at the NLM • Contains Medline, OldMedline, various other sources • Medline • Over 12 million citations dating back to 1960’s • Author abstracts and citations from > 4800 biomedical journals NaCTeM Seminar

PubMed • Entrez is NCBI’s integrated, text-based search and retrieval system for the major databases it maintains NaCTeM Seminar

User Types • Research Geneticists • Narrow information interest • Particular gene • Particular activity/functionality • Model Organism Genome DB Curators • Broader information interest • Typically track a number of publications, seeking to enhance information stored in the model organism genome DB at the locus level NaCTeM Seminar

User Scenarios: Research Geneticist • Possible scenarios using GO tagging to support a research geneticist include: • Search result presentation: • Tag abstracts returned from a PubMed search with GO codes • Use GO codes to cluster/structure search results to support more effective information access • Structuring of related literature as workflow side-effect • Many typical researcher workflows involve BLAST searches yielding BLAST/Swissprot reports • Workflow can automatically assemble a set of “related” papers by extracting PMIDs of homologous genes/proteins from reports and collecting these abstracts plus, optionally, others closely related by text similarity • Resulting abstract set can be clustered/structured by GO terms and presented to researcher (Integrating Text Mining Services into Distributed Bioinformatics Workflows: A Web Services Implementation. Gaizauskas, Davis, Demetriou, Guo and Roberts, In Proceedings of the IEEE International Conference on Services Computing (SCC 2004), 2004.) NaCTeM Seminar

Search Result Presentation: Motivating Example • One of the genes involved in the cognitive/social elements of Williams Beuren syndrome is LIM Kinase 1 (LIMK1/LIMK-1) • Putting LIM Kinase into Entrez gives 146 possible papers of interest. NaCTeM Seminar

Search Result Presentation: Motivating Example • One of the genes involved in the cognitive/social elements of Williams Beuren syndrome is LIM Kinase 1 (LIMK1/LIMK-1) • Putting LIM Kinase into Entrez gives 146 possible papers of interest. • However search in the model organism corpus for LIM Kinase yields only 5 papers but a high number of associated GO codes (and this is from only partially annotated papers): • Suggests even a single gene may be involved in numerous roles and that clustering according to GO codes may give a more focused method of searching rather than simply supplying more and more keywords which may remove useful and important papers from the result set. GO:0006468 : biological_process : protein amino acid phosphorylation GO:0004674 : molecular_function : protein serine/threonine kinase activity GO:0004672 : molecular_function : protein kinase activity GO:0007283 : biological_process : spermatogenesis GO:0008064 : biological_process : regulation of actin polymerization and/or depolymerization GO:0005515 : molecular_function : protein binding GO:0005634 : cellular_component : nucleus GO:0005925 : cellular_component : focal adhesion GO:0005515 : molecular_function : protein binding NaCTeM Seminar

Search Result Presentation: Motivating Example • However search in the model organism corpus for LIM Kinase yields only 5 papers but a high number of associated GO codes (and this is from only partially annotated papers): • Suggests even a single gene may be involved in numerous roles and that clustering according to GO codes may give a more focused method of searching rather than simply supplying more and more keywords which may remove useful and important papers from the result set. GO:0006468 : biological_process : protein amino acid phosphorylation GO:0004674 : molecular_function : protein serine/threonine kinase activity GO:0004672 : molecular_function : protein kinase activity GO:0007283 : biological_process : spermatogenesis GO:0008064 : biological_process : regulation of actin polymerization and/or depolymerization GO:0005515 : molecular_function : protein binding GO:0005634 : cellular_component : nucleus GO:0005925 : cellular_component : focal adhesion GO:0005515 : molecular_function : protein binding NaCTeM Seminar

User Scenarios: Model Organism DB Curator • Possible scenarios using GO tagging/text mining to support DB curators include: • Help assemble texts that may support GO code assignment • GO tag texts in curator’s watching brief • Automated tagging could act as prompt for/check on curator’s judgement • Help to determine gene-GO term pairs for annotation • Perform GO tagging/ gene name identification at text level and suggest all pairs as candidates • Perform GO tagging/gene name identification at sentence level and suggest candidates • Attempt to assign GO evidence codes • To text segments providing evidence for GO code assignment without identifying GO code/gene pair to which the evidenced pertains • To text segments providing evidence plus the GO code/gene pair to which the evidenced pertains NaCTeM Seminar

Possible Tasks (1) Assigning GO codes to abstracts/full papers • Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology • Task: assign 0 or more GO codes to a text iff the text is “about” the function/process/component identified by the code (assume most specific code only assigned) • Note in this task there is no association of GO code with any specific gene/gene product NaCTeM Seminar

Possible Tasks (2) Assigning GO codes to genes/gene products in abstracts/full papers • Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology • Task: If the text supports the assignment of one or more GO codes to a gene/gene product, identify gene/gene product-GO code pairs and the text supporting the assignments • This capability would support additional tasks • Given a particular gene/gene product and a text collection, find all GO codes for the gene/gene product across the collection • Given a GO code and a text collection, find all genes/gene products tagged with the code across the collection NaCTeM Seminar

Possible Tasks (3) Assigning evidence codes to genes/gene products-GO code pairings in abstracts/full papers • Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology • Task: As in Task 2. but additionally supply the evidence codes • A weaker variant of this is just to suggest evidence text that may assist in the assignment of GO code NaCTeM Seminar

Related Work • Raychaudri, Chang, Sutphin & Altman (2002) • Task: associate GO codes with genes by • Associating GO codes with papers • Associating a specific GO code with a gene if sufficient number of papers mentioning the gene have the GO associated with them • Method: Treat 1. as a document classification task and evaluate maximum entropy, Naïve Bayes and Nearest Neighbours approaches • Evaluation: corpus of 20,000 Medline abstracts assigned one or more of 21 GO terms/categories • Results: maximum entropy best -- 72.8% classification accuracy over 21 categories NaCTeM Seminar

Related Work • Go-KDS (Smith & Cleary, 2003) • Product of Reel Two • Task: assign arbitrary GO terms to PubMed articles • Method: • Proprietary Weighted Confidence learner (similar to Naïve Bayes), using only words as features • trained on gene/protein DB’s which use GO codes AND have links to Medline • Evaluated on approx. same data/task as Raychaudri et al. -- 70.5 % accuracy NaCTeM Seminar

Related Work • GoPubMed • On-going work at Dresden University (www.gopubmed.org) • Task: Annotate PubMed abstracts with GO terms • Method: Use a local sequence alignment algorithm with weighted term matching (to overcome limits of strict matching) between GO terms and strings in texts • Evaluation: None reported • Kiritchenko et al. (U. of Ottawa) • Task: assign arbitrary GO terms to biomedical texs • Method: • Treat task as hierarchical text classification • use AdaBoost.MH • Evaluation: • introduce hierarchical evaluation measure • Results unclear NaCTeM Seminar

Related Work (cont) • Biocreative challenge -- task 2 contained three related subtasks • Given an article, a protein and a GO code, where the article justifies the assignment of the GO code to the protein, find evidence text in the article supporting the assignment • Given protein-article pairs plus the number of GO code assignments supported by the article, find the GO code(s) that should be assigned to the protein based on the article • Given a set of proteins, retrieve a set of papers relevant to assigning codes to the proteins plus the GO code annotations and the supporting passages (not evaluated) • Results indicated no systems ready for practical use • Issues: lack of training data; complexity of tasks NaCTeM Seminar

Related Work (cont) • TREC Genomics Track 2004 -- three tasks related to GO code assignment • Triage -- given a set of articles find those that contain some evidence for the assignment of a GO code, i.e. warrant being curated • Given an article and names of genes occurring in the article assign one or more of the top three GO ontologies from which human curators had assigned codes • Task 2 plus provide evidence code supporting each gene-GO hierarchy label association • Results for all three tasks poor NaCTeM Seminar

Data Sets and Evaluation • In order to assess performance of GO tag assignment, a gold standard manually annotated/verified corpus is needed • However, no such corpus exists … NaCTeM Seminar

Data Sets and Evaluation • Solution 1: SGD Gold Standard • Derive a corpus from SGD model organism database (yeast) • Assemble all Medline abstracts cited as evidence supporting assignment of GO terms • Associate with each abstract the GO term whose assignment it is cited as supporting • I.e. given the annotated genes in SGD, assign a GO term T to a paper P if the paper P is referenced in support of a Gene-GO term association involving T • SGD Gold Standard • 4922 PMIDS • 2455 GO terms • 10485 PMID-GO term pairs NaCTeM Seminar

Data Sets and Evaluation: SGD “Gold” Standard • Advantages: • Data already exists -- no extra annotation work required • Can assemble similar corpora for each model organism DB • Disadvantages: • Each abstract has associated with it GO terms whose assignment to specific genes it supports, but may be missing other GO terms which can also be legitimately attached to it • Not every paper supporting a GO term assignment will be cited • Consequence: • SGD gold standard is “GO term incomplete” • Weak measure of recall • Precision figures difficult to interpret NaCTeM Seminar

Data Sets and Evaluation: SGD “Gold” Standard • Further issue: • SGD Gene-GO term assignments are based on full papers, whereas system only has access to abstracts • Consequence: • Limit on maximum Recall obtainable by system NaCTeM Seminar

Data Sets and Evaluation (cont) • Solution 2: IC Gold Standard • Manually extend the GO annotation of abstracts derivable from the SGD • Goal: GO term complete gold standard • Selected a subset (~800) for which support for all the assigned GO codes is found in the abstract (rather than the full paper) • Manually added additional GO annotations using a combination of fuzzy maching against GO and some manual addition of synonyms during checking • For included terms, include lowest within each ontology ‘cell wall biosynthesis’ => ‘cell wall biosynthesis’‘cell wall’ • Also applied same methodology to evidence paragraphs -- brief summaries written by curators deliberately using GO vocabulary • IC Gold Standard • 785 PMIDS • 1006 GO terms • 5170 PMID-GO term pairs NaCTeM Seminar

Data Sets and Evaluation (cont) • Advantages: • Much closer to a GO-term complete gold standard • Disadvantages • Still not GO-term complete • Method of creation suggests there may still be many unannotated GO terms that ought to be marked (direct mentions of GO terms vs. semantically entailed GO terms) • Gold Standard creation method favors lexical look-up approach to GO-tagging • Dataset is small NaCTeM Seminar

The Go Tagging Task Addressed • The approaches we investigated all considered Task 1, as defined earlier: • Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology • Task: assign 0 or more GO codes to a text iff the text is “about” the function/process/component identified by the code (assume most specific code only assigned) NaCTeM Seminar

Approach 1: Lexical Lookup Using Termino • Termino: a large-scale terminological resource to support term processing for information extraction, retrieval, and navigation • Termino contains a database holding large numbers of terms imported from various existing terminological resources, e.g., UMLS, GO • Efficient recognition of terms in text is achieved through use of finite state recognizers compiled from contents of database • The results of lexical look-up in Termino can feed into further term processing components, e.g., term parser • Available as a Web Service (see http://nlp.shef.ac.uk) NaCTeM Seminar

Existing Terminological Resources Raw Texts Neurofibromin GO annotations: - 0008181: tumor supressor - 0005737: cytoplasma - … Medline Abstracts UMLS GO … Mastectomy UMLS data: - CUI: C0024881 - semantic type: therapeutic or preventive procedure - synonyms: mammectomy - … Source-Specific Loaders TermDB Term Induction Peptidyl-prolyl isomerase - type: protein term - source: induced from Medline - … Finite State Look-Up Term Parser Termino Text out Text in Termino Terminology Engine NaCTeM Seminar

Lexical Look-Up for GO Tag • Termino • Imported names of all terms in GO, plus their GO ids and namespace attributes (18270 names in total) • Go term synonyms • SGD yeast gene names • Recognition of terms in text • Case-insensitive • “Simple” morphological variants are recognized • Cells mapped onto cell • Mitochondrial, mitochondria not mapped onto mitochondrion NaCTeM Seminar

Lexical Look-Up for GO Tag (cont) • GO code assignment • GO term T is assigned to text iff name of T is recognized in text • Extensions: • GO term T is assigned to a paper if synonym ofterm T occurs in the abstract of the paper • GO term T is assigned to a paper if yeast gene nameassociated with term T occurs in the abstract of the paper NaCTeM Seminar

Lexical Lookup Results for GO Slim NaCTeM Seminar

Lexical Lookup Results for GO Full NaCTeM Seminar

Lexical Lookup Approach: Discussion • Recall • Effect of curators using full text (SGD) vs. abstracts only (IC) • Inherent drawbacks of lexical look-up: term variation, literal mentions • Effects of Gold Standard creation method (IC) • Precision • Effects of Gold Standard creation method (IC) • GO vs. GO Slim • Recognizing GO Slim terms is easier than recognizing GO terms • Effects of extensions (synonyms/gene names) on performance • Adding synonyms: variable decrease in Precision, substantial increase in Recall • Adding yeast terms: substantial decrease in Precision, substantial increase in Recall NaCTeM Seminar

Error Analysis • False negatives for abstracts: • Abbreviation: mismatch repair(GO name) vs. MMR (in text) • Permutation, derivation: regulation of translation vs. regulated translation, sporulationvs.sporulate • Truncation: galactokinase activity vs. galactokinase • Alternative descriptions: protein catabolism vs. proteins for degradation, autophagic vacuolevs. autophagosomal NaCTeM Seminar

Approach 2: IR-based Vector Space Similarity • Document Collection • Build a collection of “GO documents” where each GO document consists of GO term, its synonyms and its definition sentence • Query • Treat each abstract to which GO codes are to be assigned as a query against the GO document collection • Retrieval • Given a query (i.e abstract) retrieve relevant “GO documents” (i.e. GO terms) • assign top 1, 5, 10 … GO terms to an abstract which are most “similar” as measured by Vector Space Model(VSM) NaCTeM Seminar

IR-based Approach • indexed the GO documents using Lucene search engine • Standard IR preprocessing: tokenization, stop word removal, case normalization, stemming • 4 Indices were built according varying as to whether they used • Standard GO or GO Slim • A GO document consisting of the GO term text (name + definition) or itself plus its ancestor GO terms; • Used standard weighting scheme included in Lucene • Postprocessing: • Re-weighting: give credit to duplicated GO documents (found on more than one path back to root) • Threshold: the number of relevant GOIDs to return NaCTeM Seminar

IR-Based Results • Better performance on IC abstracts than on SGD abstracts • Hierarchical documents do slightly worse than flat documents • Discriminatory effect of specific GO terms may be reducedby occurrence of general terms such as cell and protein NaCTeM Seminar

Approach 3: Machine Learning • Variety of text classification algorithms: Naïve Bayes, Decision Tree, SVM classifier, … • Naïve Bayes predicts only one GO term per abstract • SGD GS: 2.1 GO terms/abstract; IC GS: 6.6 GO terms/abstract • Features: words, frequent phrases • Preprocessing steps: tokenization, removal ofstop words, stemming • Training on 66% of annotated data, evaluation on remainder of data • GO term assignments vis-à-vis generic GO Slim to mitigate data sparsity problems NaCTeM Seminar

Machine Learning Results • One GO term vs. multiple GO terms per abstract makes a difference • Higher precision scores than lexical look-up (SGD): GO terms directly mentioned in text not be assigned if GO terms not present in training set • Oracle Text Decision Tree (IC): classifier learns systematic, strong correlation between words in text and words in GO terms NaCTeM Seminar

Comparison of Approaches • Best F scores for GO Slim • SGD Gold Standard • IC Gold Standard NaCTeM Seminar

Input keywords here Upload a file containing a list of Medline abstracts Type/paste free texts to get results NaCTeM Seminar

GO Tag: Assigning Gene Ontology Labels to Medline Abstracts