1 / 47

Corpora and Statistical Methods Lecture 9

Corpora and Statistical Methods Lecture 9. Hidden Markov Models & POS Tagging. Acknowledgement. Some of the diagrams are from slides by David Bley (available on companion website to Manning and Schutze 1999). Part 1. Formalisation of a Hidden Markov model. Crucial ingredients (familiar).

deva
Download Presentation

Corpora and Statistical Methods Lecture 9

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Corpora and Statistical MethodsLecture 9 Hidden Markov Models & POS Tagging

  2. Acknowledgement • Some of the diagrams are from slides by David Bley (available on companion website to Manning and Schutze 1999)

  3. Part 1 Formalisation of a Hidden Markov model

  4. Crucial ingredients (familiar) • Underlying states: S = {s1,…,sN} • Output alphabet (observations): K = {k1,…,kM} • State transition probabilities: A = {aij}, i,jЄ S • State sequence: X = (X1,…,XT+1) + a function mapping each Xt to a state s • Output sequence: O = (O1,…,OT) • where each otЄ K

  5. Crucial ingredients (additional) • Initial state probabilities: Π = {πi}, iЄS (tell us the initial probability of each state) • Symbol emission probabilities: B = {bijk}, i,jЄ S, k Є K (tell us the probability b of seeing observation Ot=k at time t, given that Xt=si and Xt+1 = sj)

  6. Trellis diagram of an HMM a1,1 s1 a1,2 s2 a1,3 s3

  7. Trellis diagram of an HMM a1,1 s1 a1,2 s2 a1,3 s3 o1 Obs. seq: o3 o2 t1 t3 time: t2

  8. Trellis diagram of an HMM b1,1,k=O2 b1,1,k=O3 a1,1 s1 a1,2 b1,2,k=O2 s2 a1,3 b1,3,k=O2 s3 o1 Obs. seq: o3 o2 t1 t3 time: t2

  9. The fundamental questions for HMMs • Given a model μ = (A, B, Π), how do we compute the likelihood of an observation P(O| μ)? • Given an observation sequence O, and model μ, which is the state sequence (X1,…,Xt+1) that best explains the observations? • This is the decoding problem • Given an observation sequence O, and a space of possible models μ = (A, B, Π), which model best explains the observed data?

  10. Application of question 1 (ASR) • Given a model μ = (A, B, Π), how do we compute the likelihood of an observation P(O| μ)? • Input of an ASR system: a continuous stream of sound waves, which is ambiguous • Need to decode it into a sequence of phones. • is the input the sequence [n iy d] or [n iy]? • which sequence is the most probable?

  11. Application of question 2 (POS Tagging) • Given an observation sequence O, and model μ, which is the state sequence (X1,…,Xt+1) that best explains the observations? • this is the decoding problem • Consider a POS Tagger • Input observation sequence: • I can read • need to find the most likely sequence of underlying POS tags: • e.g. is can a modal verb, or the noun? • how likely is it that can is a noun, given that the previous word is a pronoun?

  12. Finding the probability of an observation sequence

  13. Example problem: ASR • Assume that the input contains the word need • input stream is ambiguous (there is noise, individual variation in speech, etc) • Possible sequences of observations: • [n iy] (knee) • [n iy dh] (need) • [n iy t] (neat) • … • States: • underlying sequences of phones giving rise to the input observations with transition probabilities • assume we have state sequences for need, knee, new, neat, …

  14. Formulating the problem • Probability of an observation sequence is logically an OR problem: • model gives us state transitions underlying several possible words (knee, need, neat…) • How likely is the word need? We have: • all possible state sequences X • each sequence can give rise to the signal received with a certain probability (possibly zero) • the probability of the word need is the sum of probabilities with which each sequence can have given rise to the word.

  15. Simplified trellis diagram representation o1 ot-1 ot ot+1 oT start n iy dh end • Hidden layer: transitions between sounds forming the words need, knee… • This is our model

  16. Simplified trellis diagram representation o1 ot-1 ot ot+1 oT start n iy dh end • Visible layer is what ASR is given as input

  17. Computing the probability of an observation o1 ot-1 ot ot+1 oT start n iy dh end

  18. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Computing the probability of an observation

  19. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Computing the probability of an observation

  20. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Computing the probability of an observation

  21. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Computing the probability of an observation

  22. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Computing the probability of an observation

  23. A final word on observation probabilities • Since we’re computing the probability of an observation given a model, we can use these methods to compare different models • if we take observations in our corpus as given, then the best model is the one which maximises the probability of these observations • (useful for training/parameter setting)

  24. The forward procedure

  25. Forward Procedure • Given our phone input, how do we decide whether the actual word is need, knee, …? • Could compute p(O|μ) for every single word • Highly expensive in terms of computation

  26. Forward procedure • An efficient solution to resolving the problem • based on dynamic programming (memoisation) • rather than perform separate computations for all possible sequences X, keep in memory partial solutions

  27. Forward procedure • Network representation of all sequences (X) of states that could generate the observations • sum of probabilities for those sequences • E.g. O=[n iy] could be generated by • X1 = [n iy d] (need) • X2 = [n iy t] (neat) • shared histories can help us save on memory • Fundamental assumption: • Given several state sequences of length t+1 with shared history up to t • probability of first t observations is the same in all of them

  28. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Forward Procedure • Probability of the first t observations is the same for all possible t+1 length state sequences. • Define a forward variable: Probability of ending up in state si at time t after observations 1 to t-1

  29. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Forward Procedure: initialisation • Probability of the first t observations is the same for all possible t+1 length state sequences. • Define: Probability of being in state si first is just equal to the initialisation probability

  30. x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Forward Procedure (inductive step)

  31. Looking backward • The forward procedure caches the probability of sequences of states leading up to an observation (left to right). • The backward procedure works the other way: • probability of seeing the rest of the obs sequence given that we were in some state at some time

  32. Backward procedure: basic structure • Define: • probability of the remaining observationsgiven that current obs is emitted by state i • Initialise: • probability at the final state • Inductive step: • Total:

  33. Combining forward & backward variables • Our two variables can be combined: • the likelihood of being in state iat time t with our sequence of observations is a function of: • the probability of ending up in iat t given what came previously • the probability of being in iat t given the rest • Therefore:

  34. Decoding: Finding the best state sequence

  35. Best state sequence: example • Consider the ASR problem again • Input observation sequence: • [aa n iy dh ax] • (corresponds to I need the…) • Possible solutions: • I need a… • I need the… • I kneed a… • … • NB: each possible solution corresponds to a state sequence. Problem is to find best word segmentation and most likely underlying phonetic input.

  36. Some difficulties… • If we focus on the likelihood of each individual state, we run into problems • context effects mean that what is individually likely may together yield an unlikely sequence • the ASR program needs to look at the probability of entire sequences

  37. Viterbi algorithm • Given an observation sequence O and a model , find: • argmaxX P(X,O|) • the sequence of states X such that P(X,O|) is highest • Basic idea: • run a type of forward procedure (computes probability of all possible paths) • store partial solutions • at the end, look back to find the best path

  38. Illustration: path through the trellis S1 S2 S3 S4 t= 1 2 3 4 5 6 7 • At every node (state) and time, we store: • the likelihood of reaching that state at that time by the most probable path leading to that state (denoted ) • the preceding state leading to the current state (denoted )

  39. Viterbi Algorithm: definitions x1 xt-1 j o1 ot-1 ot ot+1 oT The probability of the most probable path from observation 1 to t-1, landing us in state j at t

  40. Viterbi Algorithm: initialisation x1 xt-1 j o1 ot-1 ot ot+1 oT The probability of being in state j at the beginning is just the initialisation probability of state j.

  41. o1 ot-1 ot ot+1 oT Viterbi Algorithm: inductive step x1 xt-1 xt xt+1 • Probability of being in j at t+1 depends on • the state i for which aij is highest • the probability that j emits the symbol Ot+1

  42. o1 ot-1 ot ot+1 oT Viterbi Algorithm: inductive step x1 xt-1 xt xt+1 Backtrace store: the most probable state from which state j can be reached

  43. Illustration S1 S2 S3 S4 t= 1 2 3 4 5 6 7 2(t=6) = probability of reaching state 2 at time t=6 by the most probable path(marked) through state 2 at t=6 2(t=6) =3is the state preceding state 2 at t=6 on the most probable path through state 2 at t=6

  44. o1 ot-1 ot ot+1 oT Viterbi Algorithm: backtrace x1 xt-1 xt xt+1 xT The best state at T is that state i for which the probability i(T) is highest

  45. o1 ot-1 ot ot+1 oT Viterbi Algorithm: backtrace x1 xt-1 xt xt+1 xT Work backwards to the most likely preceding state

  46. o1 ot-1 ot ot+1 oT Viterbi Algorithm: backtrace x1 xt-1 xt xt+1 xT The probability of the best state sequence is the maximum value stored for the final state T

  47. Summary • We’ve looked at two algorithms for solving two of the fundamental problems of HMMS: • likelihood of an observation sequence given a model (Forward/Backward Procedure) • most likely underlying state, given an observation sequence (Viterbi Algorithm) • Next up: • we look at POS tagging

More Related