A Cascaded Approach to Normalising Gene Mentions in Biomedical Literature

A Cascaded Approach to Normalising Gene Mentions in Biomedical Literature Hui Yang, Goran Nenadic, John Keane School of Computer ScienceManchester Interdisciplinary BioCentre G.Nenadic@manchester.ac.uk

Identification of gene names • Gene/protein names are essential for integrating and exploring bio-literature • e.g. building/browsing regulatory networks • Two step process • recognise gene mentions in text • map these to a referent database • Gene name variability and ambiguity “biologists would rather share a toothbrush than a gene name”

Outline • Overview – why a cascaded approach • Dictionary for matching • Exact-like matching • Approximate matching • Experiments • Summary and conclusions

Why a cascaded approach? • Overall aim: improve recall, but with a controlled loss of precision • Give your best shot first • intuitively: try “exact” and exact-like matches first, and then try with approximations • experimentally: find an optimal sequence • Apply further (less-reliable) steps only on (still) unmatched gene mentions

Dictionary re-engineering • Automatically generate gene name synonyms from existing DBs (e.g. Entrez Gene, UniProt) • use set of (generic, non-organism specific) rules to generate canonical representations of synonyms alphaCP-4 protein alphaCP4_protein RP13-16H11.4 RP13_6H11.4 Rev-ErbAalpha Rev_ErbAalpha ST3GALVI ST3GALVI • Two versions: preserve original synonyms as well as normalised canonical forms

Pre-processing gene mentions • Generate a set of canonical representations of a gene mention • analogous to dictionary re-engineering • but, some add-ons and differences • resolving potential acronymsinterleukin (IL)-17E  interleukin -17E, IL-17E • resolving gene name coordinations ORP 3 to 6 ORP 3 and 6 • token-based normalisation • Roman numbers, acronyms, Greek letters

1st stage: exact matching • Step E1: Match original dictionary and original mentions • Step E2: Match normalised dictionary and normalised mentions • Step E3: Match normalised dictionary and token-based normalised mentions

2nd stage: approximate matching • component-based comparisons • relevant (specific) component classes (Digit, Greek-Letter, Roman-Number, Chemical etc. tokens) • Step A1a:Component permutation (order) is ignored • Step A1b:Non-relevant components missing from a synonym are ignored

2nd stage: approximate matching • Step A2a:One non-relevant extra component in a synonym is ignored • Step A2b:One non-relevant extra component in a synonym is ignored if all relevant components are matched

Original mentions Normalised mentions Original synonyms Original vs. Original Normalised vs. Normalised Normalised vs. Token-normalised Normalised synonyms Ignore word permutations Ignore one missing non-relevant component Ignore one extra non-relevant component Ignore one extra non-relevant componentif all relevant components are matched Tokennormalised mentions

Experiments Experimental context • BioCreative II data set • Map human genes to Entrez Gene

Results: exact-like matching

Results: approximate matching

Cumulative performance • Precision: 0.93 • Recall: 0.69 • F-measure: 0.79 • For comparisons (BioCreative II test data) • Precision: 0.94 • Recall: 0.72 • F-measure: 0.81

Some conclusions • Exact-like matching achieves 0.76 F-measure (0.96 P, 0.64 R) • Approximate matching improve recall only 10-15% • ignoring word order is effective (both recall and precision-wise), as well as ignoring one extra non-relevant component (recall) • Some approaches consistent across different test sets, some not • e.g. precision of approximate match: 0.63 – 0.78 recall of exact matching: 0.59 – 0.68

Summary • Simple yet effective approach • cascaded approach with reliable matching strategies which can be switched on and off • some are good for precision, some for recall • can be easily used for other species • More work needed on • gene name coordination and enumerations • acronyms/symbols embedded in mentions • species identification

Acknowledgements • Partially funded by UK BBSRC (Project “Mining Term Associations from Literature to Support Knowledge Discovery in Biology”) • Manchester Interdisciplinary Biocentre (Irena Spasic) • Faculty of Life Sciences (Casey Bergman) • National Centre for Text Mining (NaCTeM)

A Cascaded Approach to Normalising Gene Mentions in Biomedical Literature Hui Yang, Goran Nenadic, John Keane School of Computer ScienceManchester Interdisciplinary BioCentre G.Nenadic@manchester.ac.uk

A Cascaded Approach to Normalising Gene Mentions in Biomedical Literature

A Cascaded Approach to Normalising Gene Mentions in Biomedical Literature

Presentation Transcript

Functional Gene Clustering via Gene Annotation Sentences, MeSH and GO Keywords from Biomedical Literature

Tracking Drug Trade Names in Biomedical Literature

Writing The Biomedical Manuscript: A Systematic Approach

Grounding Gene Mentions with Respect to Gene Database Identifiers

Relevance Detection Approach to Gene Annotation

Symbolic Approach to Literature

TEACHING APPROACH: LITERATURE

Psychoanalytical Approach to Analyzing Literature

Cascaded Counters

Automatically Generating Gene Summaries from Biomedical Literature

Multifaceted Approach to Biomedical Information Retrieval

A Novel Approach to Identifying Differential Gene Expression

Normalising a database

Automatically Generating Gene Summaries from Biomedical Literature

Mining the Biomedical Research Literature

Retrieve protein and gene mentions.

Biomedical Applications of Gene Dosage Compensation

Mining the Biomedical Literature

A Statistical Approach to Literature-based Gene Group Annotation

A Multi-Institutional Approach to Technical Report Literature

Cascaded Filters

Mining Biomedical Literature for Neuroanatomy