Grounding Gene Mentions with Respect to Gene Database Identifiers

Grounding Gene Mentions with Respect to Gene Database Identifiers Overview of BioCreAtIvE Task 1B Ben Hachey BioNLP Reading Group 18.07.2005

Outline • BioCreAtIvE Task 1B • Approaches to Task 1B • Related Work • Conclusions BioCreative Task 1B

BioCreAtIvE • Critical Assessment of Information Extraction Systems in Biology • Task 1A Named Entity Recognition • Given a single sentence from an abstract, to identify all mentions of genes • “(or proteins where there is ambiguity)” • Task 1B Entity Normalisation • Given NER’d abstract, associate list of unique identifiers • Task 2 Automatic GO code annotation BioCreative Task 1B

Example Abstract from Fly Since Dpp and Gbb levels are not detectably higher in the early phases of cross vein development, other factors apparently account for this localized activity. Our evidence suggests that the product of the crossveinless 2 gene is a novel member of the BMP-like signaling pathway required to potentiate Gbb of Dpp signaling in the cross veins. crossveinless 2 is expressed at higher levels in the developing cross veins and is necessary for local BMP-like activity. Input Output BioCreative Task 1B

Analogous Tasks • Can be seen as… • Grounding: Tying a textual mention of an entity to its identifier in a gene database/ontology Provides a list, without repetition, of the entities referred to in the sentence (Information Extraction) • Coreference: Identifying which textual mentions refer to the same entity • Lexical Entailment: Whether term is substitutable in given context BioCreative Task 1B

Resources • Synonym database provided for each organism: • Fly Drosophila melanogaster • Yeast Saccharomyces cerevisiae • Mouse Mus musculus • These list a number of different textual realisations for each unique gene identifier BioCreative Task 1B

Fly Synonym DB Examples BioCreative Task 1B

Data Preparation • (Start with documents whose full text has been manually curated) • Noisy Training Data • Automatically eliminate gene Ids not found in abstract • Fly: 0.83, Mouse: 0.71, Yeast: 0.92 (quality) • Testing Gold Standard • Hand check for over-zealous elimination • Add genes mentioned “in passing” (so task is same across organisms) • Fly: 0.93, Mouse: 0.87, Yeast: 0.96 (agreement) • 250 abstracts/organism BioCreative Task 1B

Evaluation • Precision, recall, and balanced f-score automatically calculated with respect to gold standard gene ID lists • 8 teams total • Various numbers of submissions (0-3) on each organism • Number of submissions • Fly: 11 • Mouse: 16 • Yeast: 15 BioCreative Task 1B

Top Systems Performance (P/R/F) BioCreative Task 1B

Approaches • Use synonyms for simple matching against text • Difficult to ID false positives • Especially mouse and fly where synonyms include e.g. common words (with, at, yellow, …) • ID gene text, then ground • Leverage NER system from Task 1A • Limited by performance of NER • 78.8% precision, 73.5% recall, 76.1 balance F • 37% of FPs and 39% of FNs due to boundary problems • Synonym lists not exhaustive BioCreative Task 1B

Information Sources • Edit synonym list? • Add other specific and frequently used synonyms • Remove problematic synonyms • String similarity • Matching against synonym list • Fuzzy matching (spelling variations, abbreviations, …) • Coreference • Synonym in same text • Other contextual evidence • Gene co-occurrence in same text • Word context around entity… • Probabilistic/Statistical models • Pr(geneID), Pr(geneID|synonym) BioCreative Task 1B

Top Systems Overview BioCreative Task 1B

Systems • Team: user24 (mouse, yeast) • Katrin Fundel, Daniel Güttler, Ralf Zimmer, and Joannis Apostolakis • Ludwig-Maximilians-Universität München • Approach: No NER, match synonyms to text • Rule-based generation and curation of synonym lists • Remove unspecific and inappropriate synonyms • Expanded to include additional, frequently used synonyms • Automatic rule-based edit system • Human curation to assure quality • Tuned using training data • Select all matches • Post-filter: Remove matches with non-gene context (e.g. ‘cells’, ‘domains’, ‘cell type’, ‘DNA binding site’) • Semi-automatic syn list curation, could be used for gazetteers! BioCreative Task 1B

Systems • Team: user16 (fly, mouse, yeast) • Daniel Hanisch, Katrin Fundel, Heinz-Theodor Mevissen, Ralf Zimmer, and Juliane Fluck • Fraunhofer Institute & Ludwig-Maximilians-Universität München • Approach: No NER, match synonyms to text • Synonym list expanded (offline) • Automatic rule-based edit system w/ human curation to assure quality • Rule-based classification of synonyms • Class I: Case-insensitive near-synonyms • Class II: Case-sensitive near-synonyms • Class III: Questionable synonyms (high frequency, inexact match) • Select n highest scoring matches (Hanisch et al., 2003) • Focus on matching multi-word terms • Syn list curation, multi-word term matching! BioCreative Task 1B

Systems • Team: user8 (fly, mouse, yeast) • Jeremiah Crim, Ryan McDonald, and Fernando Pereira • University of Pennsylvania • Approach: No NER, match synonyms to text • Pattern Matching • Synonym list pruned by threshold on conditional probability of a gene ID (g) being a label for a document given that a synonym (s) matches • List of candidate gene IDs compiled by selecting 1000 training documents with highest token-level cosine similarity • Match Classification • Binary maximum entropy classifier trained to predict whether gene IDs selected by pattern matching should be kept • Fly: +7.7, Mouse: +1.5, Yeast: -0.4 • Prob models (pruning, disambiguation), no human curation! BioCreative Task 1B

Systems • Team: user5 (fly, mouse, yeast) • Ben Hachey, Huy Nguyen, Malvina Nissim, Bea Alex, and Claire Grover • University of Edinburgh & Stanford • Approach: NER, match entities to synonyms • Build organism-specific named entity recognition • Noisy training data obtained from Task 1B materials • Match gene entities to synonym lists (fuzzy) • Incorporates various edit operations (e.g. case folding, optional dashes and other punc, Brit/Am spellings) • Tuned per-organism to select and order edit operations • Disambiguate each entity to a single gene ID • Var. heuristic, statistical approaches (e.g. gene ID co-occurrence, IR query term weighting, repetition in synonym list) • Again, optimised per-organism • Bootstrapping NE data, IR term weighting, no human curation! BioCreative Task 1B

Systems • Team: user6 (fly, mouse, yeast) • Javier Tamames • BioAlma SL • Approach: NER, match entities to synonyms • NER for various bio ents (e.g. genes, proteins, compounds) • Also bio-medical semantic tagging of words E.g. Core terms (receptor, kinase, …) and types (alpha, a1, …) • Match gene entities to synonym lists (fuzzy) • Use BioCreAtIvE lists and other relevant databases • Match and weighting based on semantic labels • Disambiguate each entity to a single gene ID • Use of key words extracted from databases (e.g. HUGO, MGI, SGD) • Semantic tagging module, Key word context from org DBs! BioCreative Task 1B

Related Work • Ben Wellner (2005). Weakly Supervised Learning Methods for Improving the Quality of Gene Name Normalization Data. In: Proceedings of BioLink-2005. • … BioCreative Task 1B

Conclusions • Model that can be automatically tuned to e.g. domain, organism • Proper modelling of: • Abbreviations • Spelling variants • Coreference in abstracts • Textual context, key words • Entity co-occurrence • Entity and term distributions • Token semantic roles BioCreative Task 1B

Thank you BioCreative Task 1B

References Lynette Hirschman, Marc Colosimo, Alexander Morgan, Jeffrey Colombe, and Alexander Yeh (2004). Task 1B: Gene list task. In: Proceedings BioCreAtIvE Workshop. Daniel Hanisch, Katrin Fundel, Heinz-Theodor Mevissen, Ralf Zimmer, and Juliane Fluck (2004). ProMiner: Organis-specific protein name detection using approximate string matching. In: Proceedings BioCreAtIvE Workshop. [user16] Katrin Fundel, Daniel Güttler, Ralf Zimmer, and Joannis Apostolakis (2004). Exact versus approximate string matching for protein name identification. In: Proceedings BioCreAtIvE Workshop.[user24] Jerimiah Crim, Ryan McDonald, and Fernando Pereira (2004). Automatically annotating documents with normalized gene lists. In: Proceedings BioCreAtIvE Workshop. [user8] Ben Hachey, Huy Nguyen, Malvina Nissim, Bea Alex, and Claire Grover (2004). Grounding gene mentions with respect to gene database identifiers. In: Proceedings BioCreAtIvE Workshop. [user5] Javer Tamames (2004). Text detective: BioAlma’s gene annotation tool. In: Proceedings BioCreAtIvE Workshop. [user6] Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen, and Ralf Zimmer (2003). Playing biology’s name game: Identifying protein names in scientific text. BioCreative Task 1B

BioCreative Task 1B

The SEER Project Team Bea Alex, Shipra Dingare, Claire Grover, Ben Hachey, Ewan Klein, Yuval Krymolowski, Malvina Nissim Edinburgh: Stanford: Jenny Finkel, Chris Manning, Huy Nguyen

Top Systems Performance (P/R/F) BioCreative Task 1B

Top Systems Rank BioCreative Task 1B

Grounding Gene Mentions with Respect to Gene Database Identifiers

Grounding Gene Mentions with Respect to Gene Database Identifiers

Presentation Transcript

Gene Expression: From Gene to Protein

Gene

The Horizontal Gene Transfer Database

Gene

Gene

Gene Expression: From Gene to Protein

Merge links between probes by Entrez Gene identifiers

Gene Mapping Gene Therapy

Gene

Introduction to Oral Cancer Gene Database

Gene

A Cascaded Approach to Normalising Gene Mentions in Biomedical Literature

Gene Expression From gene to protein

Gene Expression: From Gene to Protein

Retrieve protein and gene mentions.

Gene Expression: From Gene to Protein

GENE

GENE

Gene expression From Gene to Protein

Introduction to Oral Cancer Gene Database

Gene expression From Gene to Protein

ECDB (ENDOMETRIAL CANCER GENE DATABASE)