370 likes | 383 Views
This article explores various models for named entity recognition (NER) that can handle long-range dependencies, including HMMs, MEMMs, linear-chain CRFs, and more. It also discusses the challenges of inference and learning in general CRFs and proposes solutions like belief propagation and skip-chain CRFs. Additionally, it introduces stacked CRFs with special features for improving recall in NER tasks.
E N D
NER with Models Allowing Long-Range Dependencies William W. Cohen 2/27
Announcements • Ian’s talk moved to Thus
Some models we’ve looked at • HMMs • generative sequential model • MEMMs/aka maxent tagging; stacked learning • Cascaded sequences of “ordinary” classifiers (for stacking, also sequential classifiers) • Linear-chain CRFs • Similar functional form as an HMM, but optimized for Pr(Y|X) instead of Pr(X,Y)[Klein and Manning] • An MRF (undirected graphical model) with edge and node potentials defined via features that depend on X,Y [my lecture] • Dependency nets aka MRFs learned w/ pseudo-likelihood • Local conditional probabilities + Gibbs sampling (or something) for inference. • Easy to use a network that is not a linear chain
Example DNs – bidirectional chains Y1 Y2 … Yi … Cohen post the When will dr notes
DN examples Y1 Y2 … Yi … Cohen post the When will dr notes • How do we do inference? Iteratively: • Pick values for Y1, Y2, …at random • Pick some j, and compute • Set new value of Yj according to this • Go back to (2) Current values
DN Examples Y1 Y2 … Yi … Cohen post the When will dr notes
DN Examples POS Z1 Z2 … Zi … Y1 Y2 … Yi … BIO/NER Cohen post the When will dr notes
Example DNs – “skip” chains Y1 Y2 … … … … Y7 his wife Mi N. Dr Yu and Yu y for next/prev x=xj
Some models we’ve looked at • … • Linear-chain CRFs • Similar functional form as an HMM, but optimized for Pr(Y|X) instead of Pr(X,Y)[Klein and Manning] • An MRF (undirected graphical model) with edge and node potentials defined via features that depend on X,Y [my lecture] • Dependency nets aka MRFs learned w/ pseudo-likelihood • Local conditional probabilities + Gibbs sampling (or something) for inference. • Easy to use a network that is not a linear chain • Question: why can’t we use general MRFs for CRFs as well?
MRFs for NER When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O Y3 Y4
MRFs for NER When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O Y3 Y4 Assign Y’s to maximize “black ink” using something like forward-backward
Another visualization of the MRF Ink=potential B W B W B W All black/all white are only assignments
B W B W B W B W B W B W Best assignment to X_S maximizes black ink (potential) on chosen nodes plus edges.
B W B W B W Best assignment to X_S maximizes black ink (potential) on chosen nodes plus edges.
CRF learning – from Sha & Pereira i.e.expected value, under λ, of fi(x,yj,yj+1) partition function Pr(x) = Zλ(x) “total flow” through MRF graph In general, this is not tractible
CRF learning – from Sha & Pereira i.e.expected value, under λ, of fi(x,yj,yj+1) partition function Pr(x) = Zλ(x) “total flow” through MRF graph In general, this is not tractible
Learning general CRFs • For gradient ascent, you need to compute expectations of each feature with the current parameters • To compute expectations you need to to “forward-backward” style inference on every example. • For the general case this is NP-hard. • Solutions: • If the graph is a tree you can use belief propogation (somewhat like forward-backward) • If the graph is not a tree you can use belief propogation anyway (and hope for convergence) – “loopy” • MCMC-like methods like Gibbs are possible • But expensive, since they’re in the inner loop of the gradient ascent
Skip-chain CRFs: Sutton & McCallum • Connect adjacent words with edges • Connect pairs of identical capitalized words • We don’t want too many “skip” edges
Skip-chain CRFs: Sutton & McCallum Inference: loopy belief propogation
Repetition of names across the corpus is even more important in other domains…
How to use these regularities • Stacked CRFs with special features: • Token-majority: majority label assigned to a token (e.g., token “Melinda” person) • Entity-majority: majority label assigned to an entity (e.g., tokens inside “Bill & Melinda Gates Foundation” organization) • Super-entity-majority: majority label assigned to entities that are super-strings of an entity (e.g., tokens inside “Melinda Gates” organization) • Compute within document and across corpus
Recall-improving rule: mark every token that appears >1 time in a document as a name [Minkov, Wang, Cohen 2004 unpub]
Recall-improving rule: mark every token that appears >1 time in the corpus as a name [Minkov, Wang, Cohen 2004 unpub]
Candidate phrase classification with general CRFs; Local templates control overlap; Global templates are like ‘skip’ edges CRF + hand-coded external classifier (with Gibbs sampling) to handle long-range edges
Summary/conclusions • Linear-chain CRFs can efficiently compute expectations gradient search is ok; these are preferred in probabilistic settings • Incorporating long-range dependencies is still “cutting edge” • Stacking • More powerful graphical models (Bunescu & Mooney; Sutton & McCallum; …) • Possibly using pseudo-likelihood/dependency net extension • Separately learned or constructed long-range models that are integrated only at test time • Finkel et al (cited in Manning paper) • Roth, UIUC, work with Integer Linear Programming + CRFs • Semi-Markov models (modest but tractable extension, allows runs of identical labels but makes Viterbi O(n2)