490 likes | 841 Views
HMM (I). LING 570 Fei Xia Week 7: 11/5-11/7/07. HMM. Definition and properties of HMM Two types of HMM Three basic questions in HMM. Definition of HMM. Hidden Markov Models. There are n states s 1 , …, s n in an HMM, and the states are connected.
E N D
HMM (I) LING 570 Fei Xia Week 7: 11/5-11/7/07
HMM • Definition and properties of HMM • Two types of HMM • Three basic questions in HMM
Hidden Markov Models • There are n states s1, …, sn in an HMM, and the states are connected. • The output symbols are produced by the states or edges in HMM. • An observation O=(o1, …, oT) is a sequence of output symbols. • Given an observation, we want to recover the hidden state sequence. • An example: POS tagging • States are POS tags • Output symbols are words • Given an observation (i.e., a sentence), we want to discover the tag sequence.
V DT N N N time flies like an arrow Same observation, different state sequences P DT N N V time flies like an arrow
Two types of HMMs • State-emission HMM (Moore machine): • The output symbol is produced by states: • By the from-state • By the to-state • Arc-emission HMM (Mealy machine): • The output symbol is produce by the edges; i.e., by the (from-state, to-state) pairs.
Formal definition of PFA A PFA is • Q: a finite set of N states • Σ: a finite set of input symbols • I: Q R+ (initial-state probabilities) • F: Q R+ (final-state probabilities) • : the transition relation between states. • P:(transition probabilities)
Constraints on function: Probability of a string:
b:0.8 a:1.0 q0:0 q1:0.2 An example of PFA F(q0)=0 F(q1)=0.2 I(q0)=1.0 I(q1)=0.0 P(abn)=I(q0)*P(q0,abn,q1)*F(q1) =1.0 * 1.0*0.8n *0.2
Definition of arc-emission HMM • A HMM is a tuple : • A set of states S={s1, s2, …, sN}. • A set of output symbols Σ={w1, …, wM}. • Initial state probabilities • Transition prob: A={aij}. • Emission prob: B={bijk}
Constraints in an arc-emission HMM For any integer n and any HMM
An example: HMM structure w1 w2 w1 w1 w5 … sN s1 s2 w4 w3 Same kinds of parameters but the emission probabilities depend on both states: P(wk | si, sj) # of Parameters: O(N2M + N2).
o1 o2 on … Xn+1 X1 X2 Xn A path in an arc emission HMM • State sequence: X1,n+1 • Output sequence: O1,n
PFA vs. Arc-emission HMM A PFA is • Q: a finite set of N states • Σ: a finite set of input symbols • I: Q R+ (initial-state probabilities) • F: Q R+ (final-state probabilities) • : the transition relation between states. • P:(transition probabilities) A HMM is a tuple : • A set of states S={s1, s2, …, sN}. • A set of output symbols Σ={w1, …, wM}. • Initial state probabilities • Transition prob: A={aij}. • Emission prob: B={bijk}
Definition of state-emission HMM • A HMM is a tuple : • A set of states S={s1, s2, …, sN}. • A set of output symbols Σ={w1, …, wM}. • Initial state probabilities • Transition prob: A={aij}. • Emission prob: B={bjk} • We use si and wk to refer to what is in an HMM structure. • We use Xi and Oi to refer to what is in a particular HMM path and its output
Constraints in a state-emission HMM For any integer n and any HMM
An example: the HMM structure … s1 s2 sN w1 w2 w1 w3 w5 w1 • Two kinds of parameters: • Transition probability: P(sj| si) • Emission probability: P(wk | si) • # of Parameters: O(NM+N2)
… X1 X2 Xn o2 on o1 Output symbols are generated by the from-states • State sequence: X1,n • Output sequence: O1,n
… X2 X3 X1 Xn+1 o2 on o1 Output symbols are generated by the to-states • State sequence: X1,n+1 • Output sequence: O1,n
… X1 X2 Xn o2 on o1 … X2 X3 X1 Xn+1 o2 on o1 A path in a state-emission HMM Output symbols are produced by the from-states: Output symbols are produced by the to-states:
o1 o2 on … Xn+1 X1 X2 Xn … X2 X3 X1 Xn+1 o2 on o1 Arc-emission vs. state-emission
Properties of HMM • Markov assumption (Limited horizon): • Stationary distribution (Time invariance): the probabilities do not change over time: • The states are hidden because we know the structure of the machine (i.e., S and Σ), but we don’t know which state sequences generate a particular output.
Are the two types of HMMs equivalent? • For each state-emission HMM1, there is an arc-emission HMM2, such that for any sequence O, P(O|HMM1)=P(O|HMM2). • The reverse is also true. • How to prove that?
Applications of HMM • N-gram POS tagging • Bigram tagger: oi is a word, and si is a POS tag. • Other tagging problems: • Word segmentation • Chunking • NE tagging • Punctuation predication • … • Other applications: ASR, ….
Three fundamental questions for HMMs • Training an HMM: given a set of observation sequences, learn its distribution, i.e. learn the transition and emission probabilities • HMM as a parser: Finding the best state sequence for a given observation • HMM as an LM: compute the probability of a given observation
Training an HMM: estimating the probabilities • Supervised learning: • The state sequences in the training data are known • ML estimation • Unsupervised learning: • The state sequences in the training data are unknown • forward-backward algorithm
oT o1 o2 XT+1 … XT X1 X2 HMM as a parser: Finding the best state sequence • Given the observation O1,T=o1…oT, find the state sequence X1,T+1=X1 … XT+1 that maximizes P(X1,T+1 | O1,T). Viterbi algorithm
“time flies like an arrow” \emission N time 0.1 V time 0.1 N flies 0.1 V flies 0.2 V like 0.2 P like 0.1 DT an 0.3 N arrow 0.1 \init BOS 1.0 \transition BOS N 0.5 BOS DT 0.4 BOS V 0.1 DT N 1.0 N N 0.2 N V 0.7 N P 0.1 V DT 0.4 V N 0.4 V P 0.1 V V 0.1 P DT 0.6 P N 0.4
N V P DT Finding all the paths: to build the trellis time flies like an arrow N N N N V V V V BOS P P P P DT DT DT DT
Finding all the paths (cont) time flies like an arrow N N N N N V V V V V BOS P P P P P DT DT DT DT DT
Viterbi algorithm The probability of the best path that produces O1,t-1 while ending up in state sj: Initialization: Induction: Modify it to allow ²-emission
Viterbi algorithm: calculating ±j(t) # N is the number of states in the HMM structure # observ is the observation O, and leng is the length of observ. Initialize viterbi[0..leng] [0..N-1] to 0 for each state j viterbi[0] [j] = ¼[j] back-pointer[0] [j] = -1 # dummy for (t=0; t<leng; t++) for (j=0; j<N; j++) k=observ[t] # the symbol at time t viterbi[t+1] [j] = maxi viterbi[t] [i] aij bjk back-pointer[t+1] [j] = arg maxi viterbi[t] [i] aij bjk
Viterbi algorithm: retrieving the best path # find the best path best_final_state = arg maxj viterbi[leng] [j] # start with the last state in the sequence j = best_final_state push(arr, j); for (t=leng; t>0; t--) i = back-pointer[t] [j] push(arr, i) j = i return reverse(arr)
Hw7 and Hw8 • Hw7: write an HMM “class”: • Read HMM input file • Output HMM • Hw8: implement the algorithms for two HMM tasks: • HMM as parser: Viterbi algorithm • HMM as LM: the prob of an observation
Implementation issue storing HMM Approach #1: • ¼i: pi {state_str} • aij: a {from_state_str} {to_state_str} • bjk: b {state_str} {symbol} Approach #2: • state2idx{state_str} = state_idx • symbol2idx{symbol_str} = symbol_idx • ¼i: pi [state_idx] = prob • aij: a [from_state_idx] [to_state_idx] = prob • bjk: b [state_idx] [symbol_idx] = prob • idx2state[state_idx] = state_str • Idx2symbol[symbol_idx] = symbol_str
Storing HMM: sparse matrix • aij: a [i] [j] = prob • bjk: b [j] [k] = prob • aij: a[i] = “j1 p1 j2 p2 …” • aij: a[j] = “i1 p1 i2 p2 …” • bjk: b[j] = “k1 p1 k2 p2 ….” • bjk: b[k] = “j1 p1 j2 p2 …”
Other implementation issues • Index starts from 0 in programming, but often starts from 1 in algorithms • The sum of logprob is used in practice to replace the product of prob. • Check constraints and print out warning if the constraints are not met.
HMM as an LM: computing P(o1, …, oT) 1st try: - enumerate all possible paths - add the probabilities of all paths
Forward probabilities • Forward probability: the probability of producing O1,t-1 while ending up in state si:
Calculating forward probability Initialization: Induction:
Summary • Definition: hidden states, output symbols • Properties: Markov assumption • Applications: POS-tagging, etc. • Three basic questions in HMM • Find the probability of an observation: forward probability • Find the best sequence: Viterbi algorithm • Estimate probability: MLE • Bigram POS tagger: decoding with Viterbi algorithm