470 likes | 616 Views
Corpora and Statistical Methods Lecture 9. Hidden Markov Models & POS Tagging. Acknowledgement. Some of the diagrams are from slides by David Bley (available on companion website to Manning and Schutze 1999). Part 1. Formalisation of a Hidden Markov model. Crucial ingredients (familiar).
E N D
Corpora and Statistical MethodsLecture 9 Hidden Markov Models & POS Tagging
Acknowledgement • Some of the diagrams are from slides by David Bley (available on companion website to Manning and Schutze 1999)
Part 1 Formalisation of a Hidden Markov model
Crucial ingredients (familiar) • Underlying states: S = {s1,…,sN} • Output alphabet (observations): K = {k1,…,kM} • State transition probabilities: A = {aij}, i,jЄ S • State sequence: X = (X1,…,XT+1) + a function mapping each Xt to a state s • Output sequence: O = (O1,…,OT) • where each otЄ K
Crucial ingredients (additional) • Initial state probabilities: Π = {πi}, iЄS (tell us the initial probability of each state) • Symbol emission probabilities: B = {bijk}, i,jЄ S, k Є K (tell us the probability b of seeing observation Ot=k at time t, given that Xt=si and Xt+1 = sj)
Trellis diagram of an HMM a1,1 s1 a1,2 s2 a1,3 s3
Trellis diagram of an HMM a1,1 s1 a1,2 s2 a1,3 s3 o1 Obs. seq: o3 o2 t1 t3 time: t2
Trellis diagram of an HMM b1,1,k=O2 b1,1,k=O3 a1,1 s1 a1,2 b1,2,k=O2 s2 a1,3 b1,3,k=O2 s3 o1 Obs. seq: o3 o2 t1 t3 time: t2
The fundamental questions for HMMs • Given a model μ = (A, B, Π), how do we compute the likelihood of an observation P(O| μ)? • Given an observation sequence O, and model μ, which is the state sequence (X1,…,Xt+1) that best explains the observations? • This is the decoding problem • Given an observation sequence O, and a space of possible models μ = (A, B, Π), which model best explains the observed data?
Application of question 1 (ASR) • Given a model μ = (A, B, Π), how do we compute the likelihood of an observation P(O| μ)? • Input of an ASR system: a continuous stream of sound waves, which is ambiguous • Need to decode it into a sequence of phones. • is the input the sequence [n iy d] or [n iy]? • which sequence is the most probable?
Application of question 2 (POS Tagging) • Given an observation sequence O, and model μ, which is the state sequence (X1,…,Xt+1) that best explains the observations? • this is the decoding problem • Consider a POS Tagger • Input observation sequence: • I can read • need to find the most likely sequence of underlying POS tags: • e.g. is can a modal verb, or the noun? • how likely is it that can is a noun, given that the previous word is a pronoun?
Example problem: ASR • Assume that the input contains the word need • input stream is ambiguous (there is noise, individual variation in speech, etc) • Possible sequences of observations: • [n iy] (knee) • [n iy dh] (need) • [n iy t] (neat) • … • States: • underlying sequences of phones giving rise to the input observations with transition probabilities • assume we have state sequences for need, knee, new, neat, …
Formulating the problem • Probability of an observation sequence is logically an OR problem: • model gives us state transitions underlying several possible words (knee, need, neat…) • How likely is the word need? We have: • all possible state sequences X • each sequence can give rise to the signal received with a certain probability (possibly zero) • the probability of the word need is the sum of probabilities with which each sequence can have given rise to the word.
Simplified trellis diagram representation o1 ot-1 ot ot+1 oT start n iy dh end • Hidden layer: transitions between sounds forming the words need, knee… • This is our model
Simplified trellis diagram representation o1 ot-1 ot ot+1 oT start n iy dh end • Visible layer is what ASR is given as input
Computing the probability of an observation o1 ot-1 ot ot+1 oT start n iy dh end
x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Computing the probability of an observation
x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Computing the probability of an observation
x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Computing the probability of an observation
x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Computing the probability of an observation
x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Computing the probability of an observation
A final word on observation probabilities • Since we’re computing the probability of an observation given a model, we can use these methods to compare different models • if we take observations in our corpus as given, then the best model is the one which maximises the probability of these observations • (useful for training/parameter setting)
Forward Procedure • Given our phone input, how do we decide whether the actual word is need, knee, …? • Could compute p(O|μ) for every single word • Highly expensive in terms of computation
Forward procedure • An efficient solution to resolving the problem • based on dynamic programming (memoisation) • rather than perform separate computations for all possible sequences X, keep in memory partial solutions
Forward procedure • Network representation of all sequences (X) of states that could generate the observations • sum of probabilities for those sequences • E.g. O=[n iy] could be generated by • X1 = [n iy d] (need) • X2 = [n iy t] (neat) • shared histories can help us save on memory • Fundamental assumption: • Given several state sequences of length t+1 with shared history up to t • probability of first t observations is the same in all of them
x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Forward Procedure • Probability of the first t observations is the same for all possible t+1 length state sequences. • Define a forward variable: Probability of ending up in state si at time t after observations 1 to t-1
x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Forward Procedure: initialisation • Probability of the first t observations is the same for all possible t+1 length state sequences. • Define: Probability of being in state si first is just equal to the initialisation probability
x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Forward Procedure (inductive step)
Looking backward • The forward procedure caches the probability of sequences of states leading up to an observation (left to right). • The backward procedure works the other way: • probability of seeing the rest of the obs sequence given that we were in some state at some time
Backward procedure: basic structure • Define: • probability of the remaining observationsgiven that current obs is emitted by state i • Initialise: • probability at the final state • Inductive step: • Total:
Combining forward & backward variables • Our two variables can be combined: • the likelihood of being in state iat time t with our sequence of observations is a function of: • the probability of ending up in iat t given what came previously • the probability of being in iat t given the rest • Therefore:
Best state sequence: example • Consider the ASR problem again • Input observation sequence: • [aa n iy dh ax] • (corresponds to I need the…) • Possible solutions: • I need a… • I need the… • I kneed a… • … • NB: each possible solution corresponds to a state sequence. Problem is to find best word segmentation and most likely underlying phonetic input.
Some difficulties… • If we focus on the likelihood of each individual state, we run into problems • context effects mean that what is individually likely may together yield an unlikely sequence • the ASR program needs to look at the probability of entire sequences
Viterbi algorithm • Given an observation sequence O and a model , find: • argmaxX P(X,O|) • the sequence of states X such that P(X,O|) is highest • Basic idea: • run a type of forward procedure (computes probability of all possible paths) • store partial solutions • at the end, look back to find the best path
Illustration: path through the trellis S1 S2 S3 S4 t= 1 2 3 4 5 6 7 • At every node (state) and time, we store: • the likelihood of reaching that state at that time by the most probable path leading to that state (denoted ) • the preceding state leading to the current state (denoted )
Viterbi Algorithm: definitions x1 xt-1 j o1 ot-1 ot ot+1 oT The probability of the most probable path from observation 1 to t-1, landing us in state j at t
Viterbi Algorithm: initialisation x1 xt-1 j o1 ot-1 ot ot+1 oT The probability of being in state j at the beginning is just the initialisation probability of state j.
o1 ot-1 ot ot+1 oT Viterbi Algorithm: inductive step x1 xt-1 xt xt+1 • Probability of being in j at t+1 depends on • the state i for which aij is highest • the probability that j emits the symbol Ot+1
o1 ot-1 ot ot+1 oT Viterbi Algorithm: inductive step x1 xt-1 xt xt+1 Backtrace store: the most probable state from which state j can be reached
Illustration S1 S2 S3 S4 t= 1 2 3 4 5 6 7 2(t=6) = probability of reaching state 2 at time t=6 by the most probable path(marked) through state 2 at t=6 2(t=6) =3is the state preceding state 2 at t=6 on the most probable path through state 2 at t=6
o1 ot-1 ot ot+1 oT Viterbi Algorithm: backtrace x1 xt-1 xt xt+1 xT The best state at T is that state i for which the probability i(T) is highest
o1 ot-1 ot ot+1 oT Viterbi Algorithm: backtrace x1 xt-1 xt xt+1 xT Work backwards to the most likely preceding state
o1 ot-1 ot ot+1 oT Viterbi Algorithm: backtrace x1 xt-1 xt xt+1 xT The probability of the best state sequence is the maximum value stored for the final state T
Summary • We’ve looked at two algorithms for solving two of the fundamental problems of HMMS: • likelihood of an observation sequence given a model (Forward/Backward Procedure) • most likely underlying state, given an observation sequence (Viterbi Algorithm) • Next up: • we look at POS tagging