290 likes | 537 Views
Hidden Markov Models. Hidden Markov Model. In some Markov processes, we may not be able to observe the states directly. A HMM is a quintuple ( S, E , P, A, B ). S : {s 1 …s N } are the values for the hidden states E : {e 1 …e T } are the values for the observations
E N D
Hidden Markov Model • In some Markov processes, we may not be able to observe the states directly.
A HMM is a quintuple (S, E, P, A, B ). S : {s1…sN } are the values for the hidden states E : {e1…eT } are the values for the observations P: probability distribution of the initial state A: transition probability matrix B: emission probability matrix Hidden Markov Model X1 Xt-1 Xt Xt+1 XT e1 et-1 et et+1 eT
Inferences with HMM • Filtering: P(xt|e1:t) • Given an observation sequence, compute the probability of the last state. • Decoding: argmaxx1:t P(x1:t|e1:t) • Given an observation sequence, compute the most likely hidden state sequence. • Learning: argmax P(e1:t) where =(P, A, B ) are parameters of the HMM • Given an observation sequence, find out which transition probability and emission probability table assigns the observations the highest probability. • Unsupervised learning
Filtering P(Xt+1|e1:t+1) = P(Xt+1|e1:t, et+1) =P(et+1|Xt+1, e1:t) P(Xt+1|e1:t)/P(et+1|e1:t) =P(et+1|Xt+1) P(Xt+1|e1:t)/P(et+1|e1:t) P(Xt+1|e1:t) = xt P(Xt+1|xt, e1:t) P(xt|e1:t) Same form. Use recursion
Viterbi Algorithm • Compute argmaxx1:t P(x1:t|e1:t) • Since P(x1:t|e1:t) = P(x1:t, e1:t)/P(e1:t), • and P(e1:t) remains constant when we consider different x1:t • argmaxx1:t P(x1:t|e1:t)= argmaxx1:t P(x1:t, e1:t) • Since the Markov chain is a Bayes Net, • P(x1:t, e1:t)=P(x0) i=1,t P(xi|xi-1) P(ei|xi) • Minimize – log P(x1:t, e1:t) =–logP(x0) +i=1,t(–log P(xi|xi-1) –log P(ei|xi))
Viterbi Algorithm • Given a HMM (S, E, P, A, B ) and observations o1:t, construct a graph that consists 1+tN nodes: • One initial node • N node at time i. The jth node at time i represent Xi=sj. • The link between the nodes Xi-i=sj and Xi=sk is associated with the length –log P(Xi=sk| Xi-1=sj-1)P(ei|Xi=sk)
The problem of finding argmaxx1:t P(x1:t|e1:t) becomes that of finding the shortest path from x0=s0 to one of the nodes xt=st.
Baum-Welch Algorithm • The previous two kinds of computation needs parameters =(P, A, B ). Where do the probabilities come from? • Relative frequency? • But the states are not observable! • Solution: Baum-Welch Algorithm • Unsupervised learning from observations • Find argmax P(e1:t)
Baum-Welch Algorithm • Start with an initial set of parameters 0 • Possibly arbitrary • Compute pseudo counts • How many times the transition from Xi-i=sj to Xi=sk occurred? • Use the pseudo counts to obtain another (better) set of parameters 1 • Iterate until P1(e1:t) is not bigger than P(e1:t) • A special case of EM (Expectation-Maximization)
Xt=si Xt+1=sj Pseudo Counts • Given the observation sequence e1:T, the pseudo counts of the link from Xt=si to Xt+1=sj is the probability P(Xt=si,Xt+1=sj|e1:T)
Update HMM Parameters • Add P(Xt=si,Xt+1=sj|e1:T) to count(i,j) • Add P(Xt=si|e1:T) to count(i) • Add P(Xt=si|e1:T) to count(i,et) • Updated aij= count(i,j)/count(i); • Updated bjet=count(j,et)/count(j)
P(Xt=si,Xt+1=sj|e1:T) = P(Xt=si,Xt+1=sj, e1:t, et+1, et+2:T)/ P(e1:T) = P(Xt=si, e1:t)P(Xt+1=sj|Xt=si)P(et|Xt+1=sj) P(et+2:T|Xt+1=sj)/P(e1:T) = P(Xt=si, e1:t) aijbjetP(et+2:T|Xt+1=sj)/ P(e1:T) = i(t) aij bjetβj(t+1)/P(e1:T)
Xt=si Xt+1=sj bj(t+1) ai(t) aijbjet t-1 t t+1 t+2
P(Xt=si|e1:T) =P(Xt=si, e1:t, et+1:T)/P(e1:T) =P(et+1:T| Xt=si, e1:t)P(Xt=si, e1:t)/P(e1:T) = P(et+1:T| Xt=si)P(Xt=si|e1:t)P(e1:t)/P(e1:T) = i(t) βi(t)/P(et+1:T|e1:t)
Speech Signal • Waveform • Spectrogram
Feature Extraction Frame 1 Frame 2 Feature VectorX1 Feature VectorX2
Speech input Acoustic analysis x x 1 T ... Phoneme inventory P ( x x w w ) ... ... | 1 T 1 k Global search: Maximize Pronunciation lexicon w w ... 1 k P ( x x w w ) ... ... | 1 T 1 k ... P ( w w ) Language model 1 k over Recognized word sequence P ( x x w w ) P ( w w ) ... ... ... | ・ 1 T 1 k 1 k Speech System Architecture
HMM for Speech Recognition a24 Word Model a11 a22 a33 a01 a12 a23 a34 start0 n1 iy2 d3 end4 b1(o3) b1(o5) b1(o1) b1(o2) b1(o6) b1(o4) ObservationSequence … … o1 o2 o3 o4 o5 o6
Language Modeling • Goal: determine which sequence of words is more likely: • I went to a party • Eye went two a bar tea • Rudolph the red nose reindeer. • Rudolph the Red knows rain, dear. • Rudolph the Red Nose reigned here.
Summary • HMM • Filtering • Decoding • Learning • Speech Recognition • Feature extraction from signal • HMM for speech recognition