180 likes | 489 Views
A Cascaded Approach to Normalising Gene Mentions in Biomedical Literature. Hui Yang, Goran Nenadic , John Keane School of Computer Science Manchester Interdisciplinary BioCentre G.Nenadic@manchester.ac.uk. Identification of gene names.
E N D
A Cascaded Approach to Normalising Gene Mentions in Biomedical Literature Hui Yang, Goran Nenadic, John Keane School of Computer ScienceManchester Interdisciplinary BioCentre G.Nenadic@manchester.ac.uk
Identification of gene names • Gene/protein names are essential for integrating and exploring bio-literature • e.g. building/browsing regulatory networks • Two step process • recognise gene mentions in text • map these to a referent database • Gene name variability and ambiguity “biologists would rather share a toothbrush than a gene name”
Outline • Overview – why a cascaded approach • Dictionary for matching • Exact-like matching • Approximate matching • Experiments • Summary and conclusions
Why a cascaded approach? • Overall aim: improve recall, but with a controlled loss of precision • Give your best shot first • intuitively: try “exact” and exact-like matches first, and then try with approximations • experimentally: find an optimal sequence • Apply further (less-reliable) steps only on (still) unmatched gene mentions
Dictionary re-engineering • Automatically generate gene name synonyms from existing DBs (e.g. Entrez Gene, UniProt) • use set of (generic, non-organism specific) rules to generate canonical representations of synonyms alphaCP-4 protein alphaCP4_protein RP13-16H11.4 RP13_6H11.4 Rev-ErbAalpha Rev_ErbAalpha ST3GALVI ST3GALVI • Two versions: preserve original synonyms as well as normalised canonical forms
Pre-processing gene mentions • Generate a set of canonical representations of a gene mention • analogous to dictionary re-engineering • but, some add-ons and differences • resolving potential acronymsinterleukin (IL)-17E interleukin -17E, IL-17E • resolving gene name coordinations ORP 3 to 6 ORP 3 and 6 • token-based normalisation • Roman numbers, acronyms, Greek letters
1st stage: exact matching • Step E1: Match original dictionary and original mentions • Step E2: Match normalised dictionary and normalised mentions • Step E3: Match normalised dictionary and token-based normalised mentions
2nd stage: approximate matching • component-based comparisons • relevant (specific) component classes (Digit, Greek-Letter, Roman-Number, Chemical etc. tokens) • Step A1a:Component permutation (order) is ignored • Step A1b:Non-relevant components missing from a synonym are ignored
2nd stage: approximate matching • Step A2a:One non-relevant extra component in a synonym is ignored • Step A2b:One non-relevant extra component in a synonym is ignored if all relevant components are matched
Original mentions Normalised mentions Original synonyms Original vs. Original Normalised vs. Normalised Normalised vs. Token-normalised Normalised synonyms Ignore word permutations Ignore one missing non-relevant component Ignore one extra non-relevant component Ignore one extra non-relevant componentif all relevant components are matched Tokennormalised mentions
Experiments Experimental context • BioCreative II data set • Map human genes to Entrez Gene
Cumulative performance • Precision: 0.93 • Recall: 0.69 • F-measure: 0.79 • For comparisons (BioCreative II test data) • Precision: 0.94 • Recall: 0.72 • F-measure: 0.81
Some conclusions • Exact-like matching achieves 0.76 F-measure (0.96 P, 0.64 R) • Approximate matching improve recall only 10-15% • ignoring word order is effective (both recall and precision-wise), as well as ignoring one extra non-relevant component (recall) • Some approaches consistent across different test sets, some not • e.g. precision of approximate match: 0.63 – 0.78 recall of exact matching: 0.59 – 0.68
Summary • Simple yet effective approach • cascaded approach with reliable matching strategies which can be switched on and off • some are good for precision, some for recall • can be easily used for other species • More work needed on • gene name coordination and enumerations • acronyms/symbols embedded in mentions • species identification
Acknowledgements • Partially funded by UK BBSRC (Project “Mining Term Associations from Literature to Support Knowledge Discovery in Biology”) • Manchester Interdisciplinary Biocentre (Irena Spasic) • Faculty of Life Sciences (Casey Bergman) • National Centre for Text Mining (NaCTeM)
A Cascaded Approach to Normalising Gene Mentions in Biomedical Literature Hui Yang, Goran Nenadic, John Keane School of Computer ScienceManchester Interdisciplinary BioCentre G.Nenadic@manchester.ac.uk