Task 2: Functional annotation of gene products

BioCreAtIvECritical Assessment for Information Extraction in BiologyGranada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products

Task description • The assignment of GO annotations to human proteins • This is currently done by curators at Swiss-Prot • The full text of journal articles was used (636 training docs from J. of Biological Chemistry) • Tree subtasks

Subtasks • “Recover” text that provides evidence for the GO annotation: Given a (doc, protein, GO term) triplet, find the segment of text supporting this annotation • Provide GO annotation for human proteins: Given a (doc, protein) pair, return all GO terms that could be associated with this pair • Selection of relevant papers: detect which papers are relevant for a protein in the sense that they contain information that would be suitable to derive a GO annotation and provide the evidence text

Evaluation • The prediction were made in form of triplets (protein, paper, GO) plus a piece evidence text • More than 30,000 of these individual results were submitted and had to be reviewed by the GO curators • The scheme for both GO and proteins was • “high”: meaning that the GO term or the protein were correct • “generally”: for GO terms this means that the specific protein is not there but a homologue from another organism or a reference to the protein family • “low”: the prediction was wrong

Results – Task 2.1

Cont’d

Result – Task 2.2

Cont’d

Summary of approaches adopted by some participants • User17: Soumya Ray and Mark Craven (University of Wisconsin) • User20: Francisco M. Couto et al. (from Portugal and France) • User4: Frédéric Ehrler and Patrick Ruch (University (Hospital) of Geneva)

Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text (User17)

Informative Term Model • Identify terms that are characteristic of a given GO term • Collect training data from other organism databases – SGD, MGI, RGD, TAIR • Perform a chi-squared test to identify the informative terms • Null hypothesis: the distributions of a term in the two classes (support and background) are identical

Cont’d • Support set: a set of articles and abstracts associated with the GO term • Background set: the remaining set of articles and abstracts

FiGO: Finding GO Terms in Unstructured Text (User20) • Calculate the information content of each word occurring in the GO terms • , where #w is the number of GO terms whose name contains w, and #max is the maximum number of GO terms whose name contains a common word • The information content of a term’s name n is therefore: • A GO term may have multiple names (synonyms):

Annotation with a piece of text • Given a piece of text, the local information content of each term is defined as follow: • FiGO identifies a term in a piece of text, when its local information content is sufficiently close to its information content: , where [0,1] representing how close LIC should be from IC to decide that t is referred in p. Thus the parameter  controls the recall and precision of FiGO.

Preliminary Report on the BioCreative Experiment: Task Presentation, System Description and Preliminary Results • An IR approach • Index the collection of GO terms as if they are documents • Each document (MedLine abstract) as a query to be categorized in GO categories • Combine two retrieval engines: a vector space model (TFIDF) and a pattern-matcher • Two types of indexing unit: stems (Porter-like) and linguistically motivated phrases (noun phrases) • The UMLS is also used for string normalization

Summary • IR-like approaches generate higher recall • Almost all approaches depend on the collection of GO terms • GO terms expansion (synonyms, related terms/phrases) seems important

Task 2: Functional annotation of gene products

Task 2: Functional annotation of gene products

Presentation Transcript

Functional Gene Clustering via Gene Annotation Sentences, MeSH and GO Keywords from Biomedical Literature

3. Genome Annotation: Gene Prediction

Gene Structure Annotation

Gene Families and Functional Annotation

Rihanna Representation Annotation Task.

Functional Annotation

Functional Annotation

Gene Finding and Sequence Annotation

Functional annotation with Blast2GO

Gene Annotation Databases

Functional Annotation and Functional Enrichment

Lecture 6: Gene ontology and Gene Annotation

Gene/Protein Function Annotation

Gene Annotation Gibson and Muse Ch 2

Gene Structure Annotation

Gene Structure Annotation

Gene Annotation Databases

Gene Ontology Annotation of immune system genes

Protein Functional Annotation