100 likes | 259 Views
Named-Entity Recognition with Character-Level Models. Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning Stanford University CoNLL-2003: Seventh Conference on Natural Language Learning. Unknown Words are a Central Challenge for NER.
E N D
Named-Entity Recognition with Character-Level Models Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning Stanford University CoNLL-2003: Seventh Conference on Natural Language Learning
Unknown Words are a Central Challenge for NER • Recognizing known named-entities (NEs) is relatively simple and accurate • Recognizing novel NEs requires recognizing context and/or word-internal features • External context and frequent internal words (e.g. “Inc.”) are most commonly used features • Internal composition of NEs alone provide surprisingly strong evidence for classification (Smarr & Manning, 2002) • Staffordshire • Abdul-Karim al-Kabariti • CentrInvest
Are Names Self-Describing? • NO: names can be opaque/ambiguous Word-Level: “Washington” occurs as LOC, PER, and ORG Char-Level: “–ville” suggests LOC, but exceptions like “Neville” • YES: names can be highly distinctive/descriptive Word-Level: “National Bank” is a bank (i.e. ORG) Char-Level: “Cotramoxazole” is clearly a drug name • Question: Overall, how informative are names alone?
How Internally Descriptive are Isolated Named Entities? • Classification accuracy of pre-segmented CoNLL NEs without context is ~90% • Using character n-grams as features instead of words yields 25% error reduction • On single-word unknown NEs, word model is at chance; char n-gram model fixes 38% of errors NE Classification Accuracy (%) [not CoNLL task]
#Tom#, #Tom, Tom#, #To, Tom, om#, #T, To, om, m#, T, o, m #Tom# Exploiting Word-Internal Features • Many existing systems use some word-internal features (suffix, capitalization, punctuation, etc.) • e.g. Mikheev 97, Wacholder et al 97, Bikel et al 97 • Features usually language-dependent (e.g. morphology) • Our approach: use char n-grams as primary representation • Use all substrings as classification features: • Char n-grams subsume word features • Features are language-independent (assuming its alphabetic) • Similar in spirit to Cucerzan and Yarowsky (99), but uses ALL char n-grams vs. just prefix/suffix
Character-Feature Based Classifier • Model I: Independent classification at each word • maxent classifiers, trained using conjugate gradient • equal-scale gaussian priors for smoothing • trained models with >800K features in ~2 hrs • POS tags and contextual features complement n-grams
Character-Based CMM • Model II: Joint classifications along the sequence • Previous classification decisions are clearly relevant: • “Grace Road” is a single location, not a person + location • Include neighboring classification decisions as features • Perform joint inference across chain of classifiers • Conditional Markov Model (CMM, aka. maxent Markov model) • Borthwick 1999, McCallum et al 2000
Character-Based CMM • Final extra features: • Letter-type patterns for each word • United Xx, 12-month d-x, etc. • Conjunction features • E.g., previous state and current signature • Repeated last words of multi-word names • E.g., Jones after having seen Doug Jones • … and a few more
Final Results • Drop from English dev to test largely due to inconsistent labeling • Lack of capitalization cues in German hurts recall more because maxent classifier is precision-biased when faced with weak evidence
Conclusions • Character substrings are valuable and underexploited model features • Named entities are internally quite descriptive • 25-30% error reduction vs. word-level models • Discriminative maxent models allow productive feature engineering • 30% error reduction vs. basic model • What distinguishes our approach? • More and better features • Regularization is crucial for preventing overfitting