160 likes | 252 Views
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004. Task 2: Functional annotation of gene products. Task description. The assignment of GO annotations to human proteins This is currently done by curators at Swiss-Prot
E N D
BioCreAtIvECritical Assessment for Information Extraction in BiologyGranada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products
Task description • The assignment of GO annotations to human proteins • This is currently done by curators at Swiss-Prot • The full text of journal articles was used (636 training docs from J. of Biological Chemistry) • Tree subtasks
Subtasks • “Recover” text that provides evidence for the GO annotation: Given a (doc, protein, GO term) triplet, find the segment of text supporting this annotation • Provide GO annotation for human proteins: Given a (doc, protein) pair, return all GO terms that could be associated with this pair • Selection of relevant papers: detect which papers are relevant for a protein in the sense that they contain information that would be suitable to derive a GO annotation and provide the evidence text
Evaluation • The prediction were made in form of triplets (protein, paper, GO) plus a piece evidence text • More than 30,000 of these individual results were submitted and had to be reviewed by the GO curators • The scheme for both GO and proteins was • “high”: meaning that the GO term or the protein were correct • “generally”: for GO terms this means that the specific protein is not there but a homologue from another organism or a reference to the protein family • “low”: the prediction was wrong
Summary of approaches adopted by some participants • User17: Soumya Ray and Mark Craven (University of Wisconsin) • User20: Francisco M. Couto et al. (from Portugal and France) • User4: Frédéric Ehrler and Patrick Ruch (University (Hospital) of Geneva)
Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text (User17)
Informative Term Model • Identify terms that are characteristic of a given GO term • Collect training data from other organism databases – SGD, MGI, RGD, TAIR • Perform a chi-squared test to identify the informative terms • Null hypothesis: the distributions of a term in the two classes (support and background) are identical
Cont’d • Support set: a set of articles and abstracts associated with the GO term • Background set: the remaining set of articles and abstracts
FiGO: Finding GO Terms in Unstructured Text (User20) • Calculate the information content of each word occurring in the GO terms • , where #w is the number of GO terms whose name contains w, and #max is the maximum number of GO terms whose name contains a common word • The information content of a term’s name n is therefore: • A GO term may have multiple names (synonyms):
Annotation with a piece of text • Given a piece of text, the local information content of each term is defined as follow: • FiGO identifies a term in a piece of text, when its local information content is sufficiently close to its information content: , where [0,1] representing how close LIC should be from IC to decide that t is referred in p. Thus the parameter controls the recall and precision of FiGO.
Preliminary Report on the BioCreative Experiment: Task Presentation, System Description and Preliminary Results • An IR approach • Index the collection of GO terms as if they are documents • Each document (MedLine abstract) as a query to be categorized in GO categories • Combine two retrieval engines: a vector space model (TFIDF) and a pattern-matcher • Two types of indexing unit: stems (Porter-like) and linguistically motivated phrases (noun phrases) • The UMLS is also used for string normalization
Summary • IR-like approaches generate higher recall • Almost all approaches depend on the collection of GO terms • GO terms expansion (synonyms, related terms/phrases) seems important