NER with Models Allowing Long-Range Dependencies

NER with Models Allowing Long-Range Dependencies William W. Cohen 2/27

Announcements • Ian’s talk moved to Thus

Some models we’ve looked at • HMMs • generative sequential model • MEMMs/aka maxent tagging; stacked learning • Cascaded sequences of “ordinary” classifiers (for stacking, also sequential classifiers) • Linear-chain CRFs • Similar functional form as an HMM, but optimized for Pr(Y|X) instead of Pr(X,Y)[Klein and Manning] • An MRF (undirected graphical model) with edge and node potentials defined via features that depend on X,Y [my lecture] • Dependency nets aka MRFs learned w/ pseudo-likelihood • Local conditional probabilities + Gibbs sampling (or something) for inference. • Easy to use a network that is not a linear chain

Example DNs – bidirectional chains Y1 Y2 … Yi … Cohen post the When will dr notes

DN examples Y1 Y2 … Yi … Cohen post the When will dr notes • How do we do inference? Iteratively: • Pick values for Y1, Y2, …at random • Pick some j, and compute • Set new value of Yj according to this • Go back to (2) Current values

DN Examples Y1 Y2 … Yi … Cohen post the When will dr notes

DN Examples POS Z1 Z2 … Zi … Y1 Y2 … Yi … BIO/NER Cohen post the When will dr notes

Example DNs – “skip” chains Y1 Y2 … … … … Y7 his wife Mi N. Dr Yu and Yu y for next/prev x=xj

Some models we’ve looked at • … • Linear-chain CRFs • Similar functional form as an HMM, but optimized for Pr(Y|X) instead of Pr(X,Y)[Klein and Manning] • An MRF (undirected graphical model) with edge and node potentials defined via features that depend on X,Y [my lecture] • Dependency nets aka MRFs learned w/ pseudo-likelihood • Local conditional probabilities + Gibbs sampling (or something) for inference. • Easy to use a network that is not a linear chain • Question: why can’t we use general MRFs for CRFs as well?

MRFs for NER When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O Y3 Y4

MRFs for NER When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O Y3 Y4 Assign Y’s to maximize “black ink” using something like forward-backward

Another visualization of the MRF Ink=potential B W B W B W All black/all white are only assignments

B W B W B W B W B W B W Best assignment to X_S maximizes black ink (potential) on chosen nodes plus edges.

B W B W B W Best assignment to X_S maximizes black ink (potential) on chosen nodes plus edges.

CRF learning – from Sha & Pereira

CRF learning – from Sha & Pereira i.e.expected value, under λ, of fi(x,yj,yj+1)  partition function Pr(x) = Zλ(x)  “total flow” through MRF graph In general, this is not tractible

CRF Learning

CRF learning – from Sha & Pereira i.e.expected value, under λ, of fi(x,yj,yj+1)  partition function Pr(x) = Zλ(x)  “total flow” through MRF graph In general, this is not tractible

Learning general CRFs • For gradient ascent, you need to compute expectations of each feature with the current parameters • To compute expectations you need to to “forward-backward” style inference on every example. • For the general case this is NP-hard. • Solutions: • If the graph is a tree you can use belief propogation (somewhat like forward-backward) • If the graph is not a tree you can use belief propogation anyway (and hope for convergence) – “loopy” • MCMC-like methods like Gibbs are possible • But expensive, since they’re in the inner loop of the gradient ascent

Skip-chain CRFs: Sutton & McCallum • Connect adjacent words with edges • Connect pairs of identical capitalized words • We don’t want too many “skip” edges

Skip-chain CRFs: Sutton & McCallum Inference: loopy belief propogation

Skip-chain CRF results

Krishnan & Manning: An effective two-stage model….”

Repetition of names across the corpus is even more important in other domains…

How to use these regularities • Stacked CRFs with special features: • Token-majority: majority label assigned to a token (e.g., token “Melinda”  person) • Entity-majority: majority label assigned to an entity (e.g., tokens inside “Bill & Melinda Gates Foundation”  organization) • Super-entity-majority: majority label assigned to entities that are super-strings of an entity (e.g., tokens inside “Melinda Gates”  organization) • Compute within document and across corpus

[Minkov, Wang, Cohen 2004 unpub]

Recall-improving rule: mark every token that appears >1 time in a document as a name [Minkov, Wang, Cohen 2004 unpub]

Recall-improving rule: mark every token that appears >1 time in the corpus as a name [Minkov, Wang, Cohen 2004 unpub]

Candidate phrase classification with general CRFs; Local templates control overlap; Global templates are like ‘skip’ edges CRF + hand-coded external classifier (with Gibbs sampling) to handle long-range edges

[Kou & Cohen, SDM-2007]

Summary/conclusions • Linear-chain CRFs can efficiently compute expectations  gradient search is ok; these are preferred in probabilistic settings • Incorporating long-range dependencies is still “cutting edge” • Stacking • More powerful graphical models (Bunescu & Mooney; Sutton & McCallum; …) • Possibly using pseudo-likelihood/dependency net extension • Separately learned or constructed long-range models that are integrated only at test time • Finkel et al (cited in Manning paper) • Roth, UIUC, work with Integer Linear Programming + CRFs • Semi-Markov models (modest but tractable extension, allows runs of identical labels but makes Viterbi O(n2)

NER with Models Allowing Long-Range Dependencies

NER with Models Allowing Long-Range Dependencies

Presentation Transcript

Long-range

Long-Range Forecasting

OIB Long Range Planning

Measuring the Influence of Long Range Dependencies with Neural Network Language Models

The Sunspot Cycle: Long-Range Predictions for Long-Range Propagation

Long Range Spatial Correlations in One-Dimensional Anderson Models

Long Range Communications

New Long Range Plan

Long Range Facility Planning

MLRA Long Range Plans

Long Range Planning

Long-Range Forecasts

Long memory or long range dependence

Long Range Transport Mechanism

Long-range

Long Range Plans

Dealing with dependencies

U.S. EPA’s Evaluation of Long Range Transport Models

Long Range Planning

The Sunspot Cycle: Long-Range Predictions for Long-Range Propagation

Long Range Wireless