260 likes | 402 Views
Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE). Marc Colosimo Lynette Hirschman Alexander Morgan Alexander Yeh http://www.mitre.org/public/biocreative. Outline. Past evaluation KDD Cup 2002 Current evaluation BioCreAtIvE Summary.
E N D
Critical Assessment of Information Extraction Systems in Biology(BioCreAtIvE) Marc Colosimo Lynette Hirschman Alexander Morgan Alexander Yeh http://www.mitre.org/public/biocreative
Outline • Past evaluation • KDD Cup 2002 • Current evaluation • BioCreAtIvE • Summary
Past Evaluation: KDD 2002 Challenge Cup Evaluation • We were invited to run a task for KDD Cup 2002* • We ran one of two tasks for 2002 • Alexander Yeh was the chair for Task 1 (fly genes) • Mark Craven (U. Wisc.) was the chair for Task 2 (yeast genes) • Data-mining conf: NOT biology nor text processing *http://www.biostat.wisc.edu/~craven/kddcup/tasks.html
Task 1: For a Set of Papers on Genetics or Molecular Biology • We provided for each paper • The full text of the paper • A list of the genes mentioned in that paper • The task was to • Rank the curatable papers before the non-curatable papers • Does each paper contain any curatable gene product information (Yes/No)? • For each curatable gene mentioned in the paper, does that paper have experimental results for • Transcript(s) of that gene (Yes/No)? • Protein(s) of that gene (Yes/No)?
Results • The winner and honorable mentions were all combined teams from 2 or 3 organizations • Winner: a team from ClearForest and Celera • Used manually generated rules and patterns to perform information extraction • Also had the best score in each of the 3 sub-tasks Best MedianRanked-list: 84% 69% Yes/No curate paper: 78% 58%Yes/No gene products: 67% 35% • 18 teams submitted test results
Outline • Past evaluation • KDD Cup 2002 • Current evaluation • BioCreAtIvE • Summary
Current Evaluation: BioCreAtIvE • Organized by MITRE, CNB (Madrid) and others • Under the umbrella of the ISCB BioLINK Special Interest Group for Text Data Mining* • Two tasks • Entity extraction (MITRE) • Gene name mentions (NCBI) • Gene list (MITRE) • Functional curation (CNB-Madrid) • Automatically map text to GO (Gene Ontology) terms for proteins described in text *http://www.pdg.cnb.uam.es/BioLINK
Schedule • July 2003: initial training data & guidelines • Nov-Dec. 2003: test data released, results due Participants may chose which tasks and which sub-tasks they want to participate in. You are not limited to one or all of the tasks.
Why Evaluate Entity Extraction for Molecular Biology? • Entity extraction is a basic text mining operation • It indicates the items discussed in a document • Variations in nomenclature constitute a major stumbling block to accessing the biomedical literature • Many groups working on entity extraction • But there is no way to compare the systems • Different data sets • Different tasks • Challenge Evaluations have been successful making comparisons • This work should also lead to resources and standards for handling nomenclature
Source: Pallett, D. Garofolo, J. and Fiscus, J. (NIST) Measurements in Support of Research Accomplishments. Feb 2000. Communications of the ACM: Special Section on Broadcast News Understanding. Progress in Speech Recognition Results show decrease in error rate over time, measured by results from best system each year Note that the research community selected new, harder problems over time Can we expect the same progress for accessing biological literature?
Some Challenges of Extracting Entities in Molecular Biology Texts • Entity mentions are often common nouns (as opposed to proper nouns) • In fact, many entities are named with ordinary words • E.g., some Drosophila gene names: by, for, if, blue, saw, period, white, midget • Also, new entities are constantly being discovered and/or renamed
“Complete” Entity Extraction is More Than Finding Mentions in the Text • For each mention, it is important to determine which entity is being discussed • This is non-trivial in molecular biology • An entity can have synonyms • The same word(s) can refer to different entities • E.g., Sek1 refers to two different proteins in mice (Map2k4 and Epha4) • Mentions can share text: e.g., “MEK1/2” is about both MEK1 and MEK2
Entity Extraction Task 1A: Gene Name Mention • Data provided by Lorrie Tanabe & John Wilbur, NCBI • 15,000 sentences manually annotated for genes • 7,500 sentences for training • 2,500 sentences for development test • 5,000 sentences for testing • Example (transformed for display purposes) • Data are marked for occurrences of gene-related mentions (underlined), including binding sites, motifs, domains, proteins, promoters, etc. Structure and expression of a gene from Arabidopsis thaliana encoding a protein related to SNF1 protein kinase.
Entity Extraction Task 1B: Gene List Annotation • Given a set of abstracts We have screened the Drosophila X chromosome for genes whose dosage affects the function of the homeotic gene Deformed. One of these genes, extradenticle, encodes a homeodomain transcription factor that heterodimerizes with Deformed and other homeotic Hox proteins. Mutations in the nejire gene, which encodes a transcriptional adaptor protein belonging to the CBP/p300 family, also interact with Deformed. The other previously characterized gene identified as a Deformed interactor is Notch, which encodes a transmembrane receptor. These three genes underscore the importance of transcriptional regulation and cell-cell signaling in Hox function. Four novel genes were also identified in the screen. One of these, rancor, is required for appropriate embryonic expression of Deformed and another homeotic gene, labial. Both Notch and nejire affect the function of another Hox gene, Ultrabithorax, indicating they may be required for homeotic activity in general.
Entity Extraction Task 1B: What a Contestant’s System Should Return • Return a list of the standardized names of the genes mentioned in each abstract: • Also return 1 text mention for each gene in list 0004656, 0002522, 0015624, 0000439, 0012384, 0004647, 0000611 We have screened the Drosophila X chromosome for genes whose dosage affects the function of the homeotic gene Deformed. One of these genes, extradenticle, encodes a homeodomain transcription factor that heterodimerizes with Deformed and other homeotic Hox proteins. Mutations in the nejire gene, which encodes a transcriptional adaptor protein belonging to the CBP/p300 family, also interact with Deformed. The other previously characterized gene identified as a Deformed interactor is Notch, which encodes a transmembrane receptor. These three genes underscore the importance of transcriptional regulation and cell-cell signaling in Hox function. Four novel genes were also identified in the screen. One of these, rancor, is required for appropriate embryonic expression of Deformed and another homeotic gene, labial. Both Notch and nejire affect the function of another Hox gene, Ultrabithorax, indicating they may be required for homeotic activity in general.
Task 1B: Data Availability • Abstracts from PubMed/Medline • Training • Development test • Test • Gene lists for papers from model organism databases (Drosophila, mouse, yeast) • A list of genes (standardized names) for each paper is available • Note that gene list is for full paper, but the text we can get is just the abstract • Synonym lists provided by each database to map alternate gene names, as mentioned in text, to their unique database identifier
Task 1B: Data Set Size (in Abstracts*) Fly Mouse Yeast Training 5000 5000 5000 Development Test 150 250 (150) Test (250) (250) (250) *Each abstract is around 250 words
Task 2: Functional Annotation • Data provided by Swiss-Prot (Rolf Apweiler) and being run by Christian Blaschke (CNB-Madrid) • Task: • Automatically generate evidence for Gene Ontology annotations for a set of proteins from the text of an article • Gold standard: • SWISS PROT Human Genome Annotations • SWISS PROT curators will also check the correctness and utility of the pointers to the evidence
Functional Annotation: Sub-tasks • Return text evidence for GO annotations found in a paper • Given a full text paper, protein(s) and associated GO term(s) 2. Generate GO term(s) and evidence for a protein • Given a paper and protein(s) in the paper • Note that more than one GO term might be associated with a protein 3. Exploratory. Given a set of proteins find relevant GO annotations and evidence from full text articles
Task 2: Find Text Evidence Supporting SWISS PROT GO Annotation SWISS PROT entry for: Small inducible cytokine A8 precursor; Synonyms: CCL8; Monocyte chemotactic protein 2 ; MCP-2 Monocyte chemoattractant protein 2; HC14 GO Annotation: 0006816 Calcium ion transport
Task 2: Find Text Evidence (cont.) Protein: Small inducible cytokine A8 precursor Synonym: MCP-2 GO Annotation: 0006816 Calcium ion transport Evidence: Full text article… …cpt-cAMP (1 mM) pretreatment of the cells completely inhibited RANTES-, MIP-1-a, and MCP-2-induced Ca2+ mobilization …
Outline • Past evaluation • KDD Cup 2002 • Current evaluation • BioCreAtIvE • Summary
Summary • We are trying to help the curators by providing common challenge evaluations based on relevant problems faced by curators • Providing common evaluations provide a means to directly compare different methods and helps to advance research in the area • There is still time to compete in the current challenge http://www.mitre.org/public/biocreative
ExperimentalData Databases DB update SwissProt DataInterpretation Literature Collections Data integration via metaschemas Genbank MEDLINE Improved searchand indexing Ontologies Linking Literature, Databases, Ontologies, Data DB curation PathwayDiscovery