1 / 27

CRFs

CRFs. 米海涛 2007.03.22. Outline:. 1 、 Directed Graph HMM MEMM 2 、 Undirected Graph Conditional Random Fields 3 、 Comparison: HMM MEMM CRF. V2. V1. V4. V3. V5. 1 、 Directed Graph. Directed Graph

ping
Download Presentation

CRFs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CRFs 米海涛 2007.03.22

  2. Outline: • 1、Directed Graph • HMM • MEMM • 2、Undirected Graph • Conditional Random Fields • 3、Comparison: • HMM MEMM CRF

  3. V2 V1 V4 V3 V5 1、Directed Graph • Directed Graph • Directed graph G = (V, E), here V is a set of nodes in G, and E is the edges between the nodes V. In V each Vi represents a random variable.

  4. y1 y2 y3 y4 y5 x1 x2 x3 x4 x5 1.1 Hidden Markov Model • Generative model: Hidden Markov Model ( L.R. Rabiner. 1989) • Disadvantages strict independency assumption on observation; can not handle interaction features and long-distance dependency

  5. y1 y2 y3 y4 y5 X 1.2 Maximum Entropy Markov Model • Discriminative models:MEMM ( Andrew McCallum, 2000) • Given training set X with label sequence Y: • Train a model θ that maximizes P(Y|X, θ) • For a new data sequence x, the predicted label y maximizes P(y|x, θ) • Notice the per-state normalization (Probability of moving from state s to s’ on observation x)

  6. 1.2 Maximum Entropy Markov Model • Bias in favor of paths that go through nodes of low entropy • most paths out of a node of high entropy have small probabilities

  7. 2 CRF

  8. 2.1 Random Field

  9. 2.2 CRF • X is a random variable over data sequences to be labeled • Y is a random variable over corresponding label sequences

  10. 2.2 CRF • X is a random variable over data sequences to be labeled • Y is a random variable over corresponding label sequences

  11. X y1 y2 y3 y4 y5 2.2 Example • First order CRFs

  12. 2.2 CRF

  13. 2.3 Parameter Estimation for CRFs • Parameter Estimation for CRFs • Maximum likelihood parameter estimation:

  14. 2.4 Parameter Estimation for CRFs • Learning Algorithm: • GIS • IIS • Quasi Newton Method • L-BFGS

  15. Input Output 2.5 Example: Base Noun Phrase Recognition

  16. y1 y2 y3 y1 y2 y3 y1 y2 y3 x1 x2 x3 X X 3 Comparison: • Graphical comparison among HMM, MEMM and CRF HMM MEMM CRF

  17. y1 y2 y3 x1 x2 x3 HMM • Disadvantage: • Not practical to represent multiple interacting features or long-range dependencies of the observations • Very strict independence assumptions on the observations

  18. y1 y2 y3 X MEMM • Conditional probability P(label sequence S | observation sequence O) rather than joint probability P(O, S) • Allow arbitrary, non-independent features on the observation sequence X • The probability of a transition between labels may depend on past and future observations • Relax strong independence assumptions in generative models

  19. y1 y2 y3 X MEMM • The per-state normalization • Subject to Label Bias Problem • Bias toward states with fewer outgoing transitions

  20. y1 y2 y3 X MEMM • Label Bias Example • P(1 and 2|ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2|ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r) • Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri), however, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro). • Per-state normalization does not allow the required expectation

  21. P=1 P=1 Label Bias Example P=0.75 • Given: “rib” 3 times, “rob” 1 times • Training: p(1|0, “r”)=0.75, p(4|0, “r”)=0.25 • Inference: P=0.25

  22. CRF: • Given: “rib” 3 times, “rob” 1 times • Training: (constrains) • Init() • w(1|0, “r”)=3, w(2|1, “i”)=3, w(3|2, “b”)=3 • w(4|0, “r”)=1 , w(5|4, “o”)=1, w(3|5, “b”)=1 • Iterator() • L(0453|rib)=w(4|0, “r”)+w(5|4, “i”)+w(3|5, “b”)=1+0+1=2 • L(0123|rib)=3+3+3=9 > L(0453|rib)=1+0+1=2 right • L(0453|rob)=1+1+1=3 < L(0123|rob)=3+0+3=6 wrong • ……… • Final: • w(1|0, “r”)=3, w(2|1, “i”)=3, w(3|2, “b”)=2 • w(4|0, “r”)=2 , w(5|4, “o”)=2, w(3|5, “b”)=2 • L(0123|rib)=3+3+2=8 > L(0453|rib)=2+0+2=4 right • L(0453|rob)=2+2+2=6 > L(0123|rob)=3+0+2=5 right

  23. CRF: • Given: “rib” 3 times, “rob” 1 times • Training: (constrains) • Iterator() • Final: • w(1|0, “r”)=3, w(2|1, “i”)=3, w(3|2, “b”)=2 • w(4|0, “r”)=2 , w(5|4, “o”)=2, w(3|5, “b”)=2 • L(0123|rib)=3+3+2=8 > L(0453|rib)=2+0+2=4 right • L(0453|rob)=2+2+2=6 > L(0123|rob)=3+0+2=5 right • P(0123|rob) = 5/(6+5) • P(0453|rob) = 6/(6+5) • Inference: • L(0453|rob)=2+2+2=6 > L(0123|rob)=3+0+2=5 • P(0453|rob) = 6/(6+5)

  24. X y1 y2 y3 y4 y5 y6 Example • Second order CRFs

  25. Example • Second-order CRFs • Features: • Edge feature type 1 : e1 • State feature type 1 : s1 • Edge feature type 2 : e2 • State feature type 2 : s2

  26. y1 y2 y3 X CRF • CRFs have all the advantages of MEMMs without label bias problem • MEMM uses per-state exponential model for the conditional probabilities of next states given the current state • CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence

  27. Thanks!

More Related