1 / 37

NER with Models Allowing Long-Range Dependencies

This article explores various models for named entity recognition (NER) that can handle long-range dependencies, including HMMs, MEMMs, linear-chain CRFs, and more. It also discusses the challenges of inference and learning in general CRFs and proposes solutions like belief propagation and skip-chain CRFs. Additionally, it introduces stacked CRFs with special features for improving recall in NER tasks.

Download Presentation

NER with Models Allowing Long-Range Dependencies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NER with Models Allowing Long-Range Dependencies William W. Cohen 2/27

  2. Announcements • Ian’s talk moved to Thus

  3. Some models we’ve looked at • HMMs • generative sequential model • MEMMs/aka maxent tagging; stacked learning • Cascaded sequences of “ordinary” classifiers (for stacking, also sequential classifiers) • Linear-chain CRFs • Similar functional form as an HMM, but optimized for Pr(Y|X) instead of Pr(X,Y)[Klein and Manning] • An MRF (undirected graphical model) with edge and node potentials defined via features that depend on X,Y [my lecture] • Dependency nets aka MRFs learned w/ pseudo-likelihood • Local conditional probabilities + Gibbs sampling (or something) for inference. • Easy to use a network that is not a linear chain

  4. Example DNs – bidirectional chains Y1 Y2 … Yi … Cohen post the When will dr notes

  5. DN examples Y1 Y2 … Yi … Cohen post the When will dr notes • How do we do inference? Iteratively: • Pick values for Y1, Y2, …at random • Pick some j, and compute • Set new value of Yj according to this • Go back to (2) Current values

  6. DN Examples Y1 Y2 … Yi … Cohen post the When will dr notes

  7. DN Examples POS Z1 Z2 … Zi … Y1 Y2 … Yi … BIO/NER Cohen post the When will dr notes

  8. Example DNs – “skip” chains Y1 Y2 … … … … Y7 his wife Mi N. Dr Yu and Yu y for next/prev x=xj

  9. Some models we’ve looked at • … • Linear-chain CRFs • Similar functional form as an HMM, but optimized for Pr(Y|X) instead of Pr(X,Y)[Klein and Manning] • An MRF (undirected graphical model) with edge and node potentials defined via features that depend on X,Y [my lecture] • Dependency nets aka MRFs learned w/ pseudo-likelihood • Local conditional probabilities + Gibbs sampling (or something) for inference. • Easy to use a network that is not a linear chain • Question: why can’t we use general MRFs for CRFs as well?

  10. MRFs for NER When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O Y3 Y4

  11. MRFs for NER When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O Y3 Y4 Assign Y’s to maximize “black ink” using something like forward-backward

  12. Another visualization of the MRF Ink=potential B W B W B W All black/all white are only assignments

  13. B W B W B W B W B W B W Best assignment to X_S maximizes black ink (potential) on chosen nodes plus edges.

  14. B W B W B W Best assignment to X_S maximizes black ink (potential) on chosen nodes plus edges.

  15. CRF learning – from Sha & Pereira

  16. CRF learning – from Sha & Pereira

  17. CRF learning – from Sha & Pereira i.e.expected value, under λ, of fi(x,yj,yj+1)  partition function Pr(x) = Zλ(x)  “total flow” through MRF graph In general, this is not tractible

  18. CRF Learning

  19. CRF Learning

  20. CRF Learning

  21. CRF Learning

  22. CRF learning – from Sha & Pereira i.e.expected value, under λ, of fi(x,yj,yj+1)  partition function Pr(x) = Zλ(x)  “total flow” through MRF graph In general, this is not tractible

  23. Learning general CRFs • For gradient ascent, you need to compute expectations of each feature with the current parameters • To compute expectations you need to to “forward-backward” style inference on every example. • For the general case this is NP-hard. • Solutions: • If the graph is a tree you can use belief propogation (somewhat like forward-backward) • If the graph is not a tree you can use belief propogation anyway (and hope for convergence) – “loopy” • MCMC-like methods like Gibbs are possible • But expensive, since they’re in the inner loop of the gradient ascent

  24. Skip-chain CRFs: Sutton & McCallum • Connect adjacent words with edges • Connect pairs of identical capitalized words • We don’t want too many “skip” edges

  25. Skip-chain CRFs: Sutton & McCallum Inference: loopy belief propogation

  26. Skip-chain CRF results

  27. Krishnan & Manning: An effective two-stage model….”

  28. Repetition of names across the corpus is even more important in other domains…

  29. How to use these regularities • Stacked CRFs with special features: • Token-majority: majority label assigned to a token (e.g., token “Melinda”  person) • Entity-majority: majority label assigned to an entity (e.g., tokens inside “Bill & Melinda Gates Foundation”  organization) • Super-entity-majority: majority label assigned to entities that are super-strings of an entity (e.g., tokens inside “Melinda Gates”  organization) • Compute within document and across corpus

  30. [Minkov, Wang, Cohen 2004 unpub]

  31. Recall-improving rule: mark every token that appears >1 time in a document as a name [Minkov, Wang, Cohen 2004 unpub]

  32. Recall-improving rule: mark every token that appears >1 time in the corpus as a name [Minkov, Wang, Cohen 2004 unpub]

  33. Candidate phrase classification with general CRFs; Local templates control overlap; Global templates are like ‘skip’ edges CRF + hand-coded external classifier (with Gibbs sampling) to handle long-range edges

  34. [Kou & Cohen, SDM-2007]

  35. Summary/conclusions • Linear-chain CRFs can efficiently compute expectations  gradient search is ok; these are preferred in probabilistic settings • Incorporating long-range dependencies is still “cutting edge” • Stacking • More powerful graphical models (Bunescu & Mooney; Sutton & McCallum; …) • Possibly using pseudo-likelihood/dependency net extension • Separately learned or constructed long-range models that are integrated only at test time • Finkel et al (cited in Manning paper) • Roth, UIUC, work with Integer Linear Programming + CRFs • Semi-Markov models (modest but tractable extension, allows runs of identical labels but makes Viterbi O(n2)

More Related