1 / 67

Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms

This paper explores discriminative training methods for Hidden Markov Models (HMMs) using Perceptron algorithms. It covers the tagging problem in NLP, modelling HMMs, training methods, Viterbi algorithm, proposed Perceptron Algorithm, and more.

bondm
Download Presentation

Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms Group 9: Jaspreet Arora, Vaishnavi Ravindran, Mathanky Sankaranarayanan, Shadi Shahsavari

  2. Outline • Tagging problem in NLP • Modelling the tagging problem as Hidden Markov Models • HMM example • Training HMMs - MLE • Viterbi algorithm + example • Proposed method: Perceptron Algorithm • Proposed method: Example trigram • Theorems • Results

  3. Parts of Speech • Word classes, Lexical Categories, or “tags” • Noun, Verb, Adverb, Adjective, Preposition, etc. • Open & Closed classes • E.g. Pronouns, articles are limited

  4. POS Tagging Problem • The POS tagging problem is to determine the POS tag for a particular instance of a word • Problem? • Tags depend on context • New types of contexts and new words keep coming up in dictionaries in various languages • manual POS tagging is not scalable 

  5. Example

  6. Example • E.g. Mrs. Stark never got around to joining the team • All you gotta do is go around the corner • The entry fee costs around 250.

  7. Example - ambiguity • E.g. Mrs. Stark never got around/RP to joining the team • All you gotta do is go around/IN the corner • The entry fee costs around/RB 250 RP – Particle (a particle is a function word that must be associated with another word or phrase to impart meaning) IN – Preposition or subordinating conjunction RB - Adverb (modifying 250)

  8. Example

  9. Set of possibilities

  10. Supervised Learning Problem • Input a sequence of observations x = x1...xn – e.g. x = the men saw the dog • Output a sequence of labels y = y1....yn – e.g. y = D N V D N

  11. Supervised Learning Problem • Learn a function h that maps input x to labels y • Y(x) is set of possible labels for x • In a structured problem Y(x) is very large • Each member y has a structure • In tagging, we are trying to predict many tags for the entire sentence • Structured prediction as density estimation : p(y/x)

  12. Supervised Learning Problem • Input a sequence of observations x = x1...xn – e.g. x = the men saw the dog • Output a sequence of labels y = y1....yn – e.g. y = D N V D N • In a probabilistic model, we want: – argmaxy P(y|x) = argmaxy P(x,y)/P(x) = argmaxy P(x,y)

  13. Generative Probabilistic Model P(x,y) = P(x|y)P(y) Eg. The can is in the garage x = {the, can, is, in, the, garage} y = {DT, N.. etc.} Local - Can is more likely a modal verb than a noun Context - Noun is much more likely than a verb to follow a determiner

  14. Generative Probabilistic Model P(x,y) = P(x|y)P(y) Eg. The can is in the garage Local - P(x/y) Can is more likely a modal verb than a noun Context - P(y) Noun is much more likely than a verb to follow a determiner

  15. Markov Chains We first need to understand what is the Markov property and what is it used for? • Objective in Markov chain model: Markov chain model is used to find the probability of observing a sequence of ordered events (there is dependency between the events) Find: 2. What is the Markov property? By chain rule we know that the probability of any sequence is modelled as:

  16. What is the Markov property? • But according to Markov property, an event at time t only depends on the event at time t-1. So now to model a sequence • Now we need to know two types of probabilities to achieve the above: • At time Called Initial distribution. Since we have N states, we will have N probabilities:

  17. What is the Markov property? 2. At time We call this the transition probability matrix A.

  18. Example 1 : Predicting weather • Let's say we have three states : Sunny, Windy, Foggy • Transition probability matrix:

  19. Example 1 : Predicting weather (Continued) What’s the probability that tomorrow is sunny and the day after is rainy, given today is sunny?

  20. Hidden Markov Model We can now use the Markov property to build what are known as Hidden Markov Models. What is the objective of HMMs? • Input: an ordered sequence of events (Observed Sequence) • Output: also an ordered sequence of events (Hidden Sequence) • Example: • Find: P(<y1,y2,3,...,yn>) where <y1,y2,3,...,yn> are some sequence of ordered events given that the input is <x1,x2,x3,...,xn>. • The tagging problem is a perfect example of this. The input is a sequence of words (order matters) and the output is a sequence of tags for these words.

  21. The decoding problem in HMM • Given a HMM model 𝛌 = (A,B,π) and an input observation sequence we want to find the most probable sequences • We will refer to as x and as y • In the above equation, we have two probabilities: • the conditional probability of observation sequence when a state sequence known (by the markov assumption) • P(y|𝛌): the prior probability of a state sequence 𝒚 (by the markov assumption) Emission state prob. Transition prob.

  22. Learning the parameters of an HMM • Given an observation sequence x (training data) and the set of possible states in the HMM, learn the HMM parameters 𝛌 = (A,B,π) • Find 𝛌 = (A,B,π) that locally maximizes (Maximum Likelihood Estimation, MLE) • Find A,B using counts from the training data: • Where C(i->j) is the number of times state i transits to state j Where C(i->S) is the number of times state i was seen with symbol S

  23. Finding output sequence - Brute Force • We now need to enumerate all the possible state sequences, and pick the one with the maximum likelihood • How many such sequences would we have? • NT ----> Why?

  24. Finding output sequence - Brute Force |Total no. of states| ^ (length of sequence) Exponential growth

  25. Viterbi Algorithm Dynamic Programming to the rescue

  26. So far what we have learnt: • Pos Tagging as Markov chains • Based on probabilistic methods we find the most possible sequence • Question: How we find these probabilities of long sequences? By using Viterbi algorithm

  27. Viterbi Algorithm • Main idea: using previous calculations to get new results • Uses a table to store intermediate values • Approach: • Compute the likelihood of the observation sequence • By summing over all possible hidden state sequences • But doing this efficiently

  28. Viterbi Algorithm: An example How many distinct ways exists from A to B? With only Right and down movements A B

  29. Viterbi Algorithm: An example Wouldn’t be easier to know ways from A to C and D first? A D C B

  30. Viterbi Algorithm in POS tagging: • Dynamical programming algorithm that allows us to compute the most probable path. • Here, P(x,y) = P(x|y)P(y) = ∏iP(xi |yi )• ∏iP(yi |yi-1 ) • Therefore a recursive algorithm can be proposed

  31. How To be Recursive?

  32. How To be Recursive? natural language processing ( nlp )

  33. How To be Recursive? • Assume a sequence of words x1,x2,x3,...,xt with corresponding tags y1,y2,y3,...,yt (states) . • Let’s define the final state as j, i.e., yt=j • We would like to calculate P(x,y)=P(y1,y2,y3,...,yt=j,x1,x2,x3,...,xt) • Let’s define vt (j) =max1,..,t-1 P(x,y) • vt (j) is the probability of the most probable path accounting for the first t observations and ending in state j

  34. How To be Recursive? vt (j) = max1,..,t-1 P(y1,y2,y3,...,yt=j,x1,x2,x3,...,xt) = max1,..,t-2maxi P(y1,y2,y3,...,yt-1= i,yt= j,x1,x2,x3,...,xt) = max1,..,t-2maxi P(y1,y2,y3,...,yt-1= i,x1,x2,x3,...,xt-1)P(yt= j |yt-1= i)P(yt|xt) = maxi max1,..,t-2P(y1,y2,y3,...,yt-1= i,x1,x2,x3,...,xt-1)P(yt= j |yt-1= i)P(yt|xt) =maxi vt-1 (i) P(yt= j |yt-1= i)P(yt|xt)

  35. How To be Recursive? Therefore: vt (j)=maxi vt-1 (i) P(yt= j |yt-1= i)P(yt|xt) If we use HMM parameters: vt (j)=maxi vt-1(i) aijbjxt

  36. Viterbi steps Step 1: Initialization when 𝑡=1: • v1(j) = 𝜋𝑗 𝑏𝑗𝑥1, for 1≤𝑗≤𝑁 Step 2: Recursion when 1<𝑡≤𝑇: • vt(j) = maxi vt-1(i)aijbjxt (1< t<=T ),for 1≤𝑗≤𝑁 Step 3: Termination • 𝑃∗=maxP(𝒚|𝒙,𝜆)=max𝑗 𝑣𝑇(𝑗) • Backtracking from argmax𝑗 𝑣T(𝑗)

  37. How To be Recursive?

  38. How to calculate these probabilities? • A probabilistic model • Using Maximum entropy methods: • logP(x,y)=∑ilogP(xi|yi)+∑ilogP(yi|yi-1) • A non-probabilistic model: • Perceptron method

  39. Perceptron Algorithm • Rosenblatt’s perceptron is a binary single-neuron model. • It was the first algorithmically described neural network that consists of a linear combiner followed by a hard limiter. • The inputs integration is implemented through the addition of the weighted inputs that have fixed weights obtained during the training stage. • If the result of this addition is larger than a given threshold θ the neuron fires. • When the neuron fires its output is set to 1, otherwise it’s set to -1.

  40. Perceptron Algorithm Finds a vector w such that the corresponding hyperplane separates + from -.

  41. Perceptron Algorithm

  42. Voted Perceptron

  43. Voted Perceptron

  44. Averaged Perceptron

  45. POS Tagging features for perceptron

  46. POS Tagging features for perceptron • For every word/tag pair in the training data we create multiple features that are functions of the “history” at that point and the tag. • Example: • Global features are defined for a whole sequence. They are the sums of the above local features summed over every word/tag pair in that sequence.

  47. Example of Feature vector

  48. Generalised Parameter Estimation Feature vectors 𝝓 together with a parameter vector ϵ Rd are used to define a conditional probability distribution over tags given a history as Where

  49. Generalised Parameter Estimation The log of the probability has the form Log p(t | h; ) = The log probability for a sequence (w[1:n]; t[1:n]) pair will be where hi = <ti-1; ti-2; w[1:n]; i>.

More Related