1 / 16

Hidden Markov Models & Conditional Random Fields

Hidden Markov Models & Conditional Random Fields. Eran Segal Weizmann Institute. Representing Time. Add the time dimension to variables X  X (t) Assumptions Time can be discretized into interesting points t 1 ,t 2 ,...t n Markov assumption holds

smulkey
Download Presentation

Hidden Markov Models & Conditional Random Fields

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hidden Markov Models & Conditional Random Fields Eran Segal Weizmann Institute

  2. Representing Time • Add the time dimension to variables X  X(t) • Assumptions • Time can be discretized into interesting points t1,t2,...tn • Markov assumption holds • Distribution is stationary (time invariant or homogeneous)

  3. Hidden Markov Model • Single (hidden) state variable • Single (observed) observation variable • Transition probability P(Y’|Y) assumed to be sparse • Usually encoded by a state transition graph • Observation probability P(X|Y) Y0 Y0 Y1 Y2 Y Y’ Y3 X0 X1 X2 X’ X3 G G0 Unrolled network

  4. Hidden Markov Model • Single (hidden) state variable • Single (observed) observation variable • Transition probability P(Y’|Y) assumed to be sparse • Usually encoded by a state transition graph • Observation probability P(X|Y) P(Y’|Y) Y1 Y2 Y3 Y4 State transition representation

  5. Hidden Markov Models • Generative models • Assign a joint probability to paired label and observation • Parameters trained to maximize train joint likelihood • To make inference tractable, there are typically no long-range dependencies Y0 Y1 Y2 Y3 X1 X2 X3 Unrolled network

  6. Conditional Models • Specifies the probability of possible label sequences given the observations, P(Y|X) • Does not “waste” parameters on modeling P(X|Y) • Key advantage: • Distribution over Y can depend on non-independent features without modeling feature dependencies • Transition probabilities can depend on past and future • Two representations • Maximum Entropy Markov Models (MEMMs) • Conditional Random Fields (CRFs)

  7. Max Entropy Markov Models • Models the probability over the next state given the previous state and the observations • Training is by iterative scaling in the maximum entropy framework • Weakness: label bias problem Y0 Y1 Y2 Y0 Y1 Y2 Y3 Y3 X1 X2 X1 X2 X3 X3 HMM MEMM

  8. Label-Bias Problem of MEMMs • Transitions from a state compete only with each other • Transition scores are conditional probabilities of next states given current state and observation • This implies a “conservation of score mass” whereby mass of a state is distributed among next states • Observations do not affect the total mass of next states •  Biases for states with fewer outgoing transitions • States with a single outgoing transition ignore observations • Unlike HMMs, observations after a branch point in a path are ignored

  9. Label-Bias Problem: Example • A model for distinguishing ‘rob’ from ‘rib’ • Suppose we get an input sequence ‘rib’ • First step, ‘r’ matches both possible states so equally likely • Next, ‘i’ is observed, but since both y1 and y4 have have one outgoing state, they both give probability 1 to the next state • Does not happen in HMMs • Note: if one word is more likely in train it will win P(Y’|Y) i y1 y2 r b y0 y3 r y5 b y4 o State transition representation

  10. Conditional Random Fields • Advantages of MEMMs without the label bias problem • Key difference • MEMMs use per-state model for conditional probabilities of next state given current state • CRFs have a single model for the joint probability of the entire sequence of labels given the observations • Thus, weights of different features at different states can trade off against each other • CRF training • Maximum likelihood or MAP • Loss function is convex, guaranteeing convergence to global optimum

  11. Conditional Random Fields • Let G=(V,E) be a graph with vertices V and edges E, such that • Then (X,Y) is a CRF if the random variables Yv obey the Markov property with respect to the graph: • where Yj is the set of Y neighbors of Yi • Model only P(Y|X) Y0 Y1 Y2 Y3 X1 X2 X3 CRF

  12. Conditional Random Fields • Joint probability distribution for trees over Y • Cliques (and thus potentials) are the edges and vertices • x are the observed variables • y are the state variables • y[S] is the components of Y associated with vertices in S • fk is an edge feature with weight λk • gk is a vertex feature with weight μk • Note that features can be over all of variables in x

  13. Representing HMMs with CRFs • An HMM can be represented by the following CRF • where • Note that we defined one feature for each state pair (y’,y) and one feature for each state-observation pair (y,x)

  14. Parameter Estimation

  15. Applications

  16. Conditional Random Fields • Joint probability distribution for trees over Y • Cliques (and thus potentials) are the edges and vertices

More Related