140 likes | 502 Views
Towards a semantic extraction of named entities. Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK. Introduction. Challenges posed by progression from traditional IE to a more semantic representation of NEs
E N D
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK
Introduction • Challenges posed by progression from traditional IE to a more semantic representation of NEs • What techniques are best for the deeper level of analysis necessary? • Can traditional rule-based methods cope with such a transition, or does the future lie solely with machine learning?
The ACE program “A program to develop technology to extract and characterise meaning from human language” Aims: • produce structured information about entities, events and the relations that hold between them • promote design of more generic systems rather than those tuned to a very specific domain and text type (as with MUC)
The ACE tasks • Identification of entities and classification into semantic types (Person, Organisation, Location, GPE, Facility) • Identification and coreference of all mentions of each entity in the text (name, pronominal, nominal) • Identification of relations holding between such entities
<entity ID="ft-airlines-27-jul-2001-2" GENERIC="FALSE" entity_type = "ORGANIZATION"> <entity_mention ID="M003" TYPE = "NAME" string = "National Air Traffic Services"> </entity_mention> <entity_mention ID="M004" TYPE = "NAME" string = "NATS"> </entity_mention> <entity_mention ID="M005" TYPE = "PRO" string = "its"> </entity_mention> <entity_mention ID="M006" TYPE = "NAME" string = "Nats"> </entity_mention> </entity>
The MACE System • Rule-based NE system developed within GATE, adapted from ANNIE • PRs: tokeniser, sentence splitter, POS tagger, gazetteer, semantic tagger, orthomatcher, pronominal and nominal coreferencer • Also: genre ID, switching controller to select different PRs automatically
Differences between ANNIE and MACE • Locations Location / GPE • GPEs have roles (GPE, Per, Org, Loc) • New type Facility (subsumes some Orgs) • Metonymy means context is necessary for disambiguation (e.g. England cricket team vs England country) • No Date, Time, Money, Percent, Address, Identifier
What does this mean in practical terms? • Separation of specific from general information makes adaptation easier • Reclassification of gazetteers unnecessary • Changes mainly to semantic grammars to - use different gazetteer lookups • use more contextual information • group rules together differently
Semantic Grammars • ANNIE uses 21 phases, 187 rules, 9 entity types (av. 20.8 rules per entity type) • MACE uses 15 phases, 180 rules, 5 entity types (av. 36 rules per entity type) • The important factor is the increased complexity of new rules, rather than the number • Rules may be hand-crafted, but an experienced JAPE user can write several rules per minute • 6 weeks for adaptation
Evaluation (2) • NEWS – 92 articles (business news) • ACE – 86 broadcast news from September 2002 evaluation • Difference on ACE task • MACE on MUC-style annotations • GPEs are left as GPE (so count as errors) • GPEs are mapped to Locations
Comparison of ANNIE vs MACE 72% Precision, 84% Recall if GPEs mapped to Locations
Conclusions • MACE is a rule-based NE system, in contrast with most systems which use ML. • Advantages that doesn’t require much training data, and is fast to adapt because of its robust design • If large amounts of training data are available, HMM-based systems tend to perform slightly better • Rule-based systems tend to be good at recall but sometimes low on precision unless supported additionally by ML methods