230 likes | 371 Views
Combining terminology resources and statistical methods for entity recognition: an evaluation. Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo presented by George Demetriou Natural Language Processing Group, University of Sheffield, UK. Introduction.
E N D
Combining terminology resources and statistical methods for entity recognition: an evaluation • Angus Roberts, Robert Gaizauskas, • Mark Hepple, Yikun Guo • presented by George Demetriou • Natural Language Processing Group, University of Sheffield, UK
Introduction • Combining techniques for entity recognition: • Dictionary based term recognition • Filtering of ambiguous terms • Statistical entity recognition • How do the techniques compare: separately and in combination? • When combined, can we retain the advantages of both?
Semantic annotation of clinical text Punch biopsy of skin. No lesion on the skin surface following fixation. • Our basic task is semantic annotation of clinical text • For the purposes of this paper, we ignore: • Modifiers such as negation • Relations and coreference • These are the subject of other papers Investigation Locus Condition Locus
Entity recognition in specialist domains • Specialist domains, e.g. medicine, are rich in: • Complex terminology • Terminology resources and ontologies • We might expect these resources to be of use in entity recognition • We might expect annotation using these resources to add value to the text, providing additional information to applications
Ambiguity in term resources • Most term resources have not been designed with NLP applications in mind • When used for dictionary lookup, many suffer from problems of ambiguity • I: Iodine, an Iodine test or the personal pronoun • be: bacterial endocarditis or the root of a verb • Various techniques can overcome this: • Filtering or elimination of problematic terms • Use of context: in our case, statistical models
Corpus: the CLEF gold standard • For experiments, we used a manually annotated gold standard • Careful construction of a schema and guidelines • Double annotation with a consensus step • Measurement of Inter Annotator Agreement (IAA) • (Roberts et al 2008 LREC bio text mining workshop) • For the experiments reported, we use 77 gold standard documents
Dictionary lookup: Termino External ontologies Termino matchers Termino database External terminologies Termino annotators External databases Link back to resources • Termino is loaded from external resources • FSM matchers are compiled out of Termino
Finding entities with termino Termino GATE application pipeline Linguistic pre-processing Termino term recognition Annotated texts Application texts • Termino loaded with selected terms from UMLS (600K terms) • Pre-processing includes tokenisation and morphological analysis • Lookup is against the roots of tokens
Filtering problematic terms • Many UMLS terms are not suitable for NLP • Ambiguity with common general language words • To identify the most problematic of these, we ran Termino over a separate development corpus, and manually inspected the results • A supplementary list of missing terms was compiled by domain experts (6 terms) • Creation of these lists took a couple of hours
Creating the filter list • Add all unique terms of 1 character to the list • For all unique terms of <= 6 characters: • Add to the list if it matches a common general language word or abbreviation • Add to the list if it has a numeric component • Reject from the list if it is an obvious technical term • Reject from the list if none of the above apply • Filter list size: 232 terms
Entities found by Termino • UMLS alone gives poor precision, due to term ambiguity with general language words • Adding in the filter list improves precision with little loss in recall
Statistical entity recognition • Statistical entity recognition allows us to model context • We use an SVM implementation provided with GATE • Mapping of our multi-class entity recognition task to binary SVM classifiers is handled by GATE
Features for machine learning • Token kind (e.g. number, word) • Orthographic type (e.g. lower case, upper case) • Morphological root • Affix • Generalised part of speech • The first two characters of Penn Treebank tagset • Termino recognised terms
Finding entities: ML Term model learning Statistical model of text Term model application Application texts GATE training pipeline Linguistic processing Gold standard annotated texts (human annotated) GATE application pipeline Linguistic processing Annotated texts
Finding entities: ML + Termino GATE training pipeline Term model learning Termino term recognition Termino Statistical model of text GATE application pipeline Term model application Termino term recognition Application texts Linguistic processing Gold standard annotated texts (human annotated) Linguistic processing Annotated texts
Entities found by SVM • Statistical entity recognition alone gives a higher P than dictionary lookup, but a lower R • The combined system gains from the higher R of dictionary lookup, with no loss in P
Linkage to external resources The peritoneum contains deposits of tumour... the tumour cells are negative for desmin. • Semantic annotation allows us to link texts to existing domain resources • Giving more intelligent indexing and making additional information available to applications
Linkage to external resources • UMLS links terms to Concept Unique Identifiers (CUIs) • Where a recognised entity is associated with an underlying Termino term, can likewise automatically link the entity to a CUI • If the SVM finds an entity when Termino has found nothing, the entity cannot be linked to a CUI
CUIs assigned • At least one CUI can be automatically assigned to 83% of the terms in the gold standard • Some are ambiguous, and resolution is needed
Availability • Most of the software is open source and can be downloaded as part of GATE • We are currently packaging Termino for public release • We are currently preparing a UK research ethics committee application for release of the annotated gold standard
Conclusions • Dicitionary lookup gives a good recall but poor precision, due to term ambiguity • Much ambiguity is due to a few of terms, which can be filtered to give little loss in recall • Combining dictionary lookup with statistical models of context improves precision • A benefit of dictionary lookup, linkage to external resources, can be retained in the combined system
Questions? http://www.clinical-escience.org • http://www.clef-user.com