Hidden Markov Models

Hidden Markov Models

Hidden Markov Model • In some Markov processes, we may not be able to observe the states directly.

A HMM is a quintuple (S, E, P, A, B ). S : {s1…sN } are the values for the hidden states E : {e1…eT } are the values for the observations P: probability distribution of the initial state A: transition probability matrix B: emission probability matrix Hidden Markov Model X1 Xt-1 Xt Xt+1 XT e1 et-1 et et+1 eT

Inferences with HMM • Filtering: P(xt|e1:t) • Given an observation sequence, compute the probability of the last state. • Decoding: argmaxx1:t P(x1:t|e1:t) • Given an observation sequence, compute the most likely hidden state sequence. • Learning: argmax P(e1:t) where =(P, A, B ) are parameters of the HMM • Given an observation sequence, find out which transition probability and emission probability table assigns the observations the highest probability. • Unsupervised learning

Filtering Example

Viterbi Algorithm • Compute argmaxx1:t P(x1:t|e1:t) • Since P(x1:t|e1:t) = P(x1:t, e1:t)/P(e1:t), • and P(e1:t) remains constant when we consider different x1:t • argmaxx1:t P(x1:t|e1:t)= argmaxx1:t P(x1:t, e1:t) • Since the Markov chain is a Bayes Net, • P(x1:t, e1:t)=P(x0) i=1,t P(xi|xi-1) P(ei|xi) • Minimize – log P(x1:t, e1:t) =–logP(x0) +i=1,t(–log P(xi|xi-1) –log P(ei|xi))

Viterbi Algorithm • Given a HMM (S, E, P, A, B ) and observations o1:t, construct a graph that consists 1+tN nodes: • One initial node • N node at time i. The jth node at time i represent Xi=sj. • The link between the nodes Xi-i=sj and Xi=sk is associated with the length –log P(Xi=sk| Xi-1=sj-1)P(ei|Xi=sk)

The problem of finding argmaxx1:t P(x1:t|e1:t) becomes that of finding the shortest path from x0=s0 to one of the nodes xt=st.

Example

Baum-Welch Algorithm • The previous two kinds of computation needs parameters =(P, A, B ). Where do the probabilities come from? • Relative frequency? • But the states are not observable! • Solution: Baum-Welch Algorithm • Unsupervised learning from observations • Find argmax P(e1:t)

Baum-Welch Algorithm • Start with an initial set of parameters 0 • Possibly arbitrary • Compute pseudo counts • How many times the transition from Xi-i=sj to Xi=sk occurred? • Use the pseudo counts to obtain another (better) set of parameters 1 • Iterate until P1(e1:t) is not bigger than P(e1:t) • A special case of EM (Expectation-Maximization)

Xt=si Xt+1=sj Pseudo Counts • Given the observation sequence e1:T, the pseudo counts of the link from Xt=si to Xt+1=sj is the probability P(Xt=si,Xt+1=sj|e1:T)

Update HMM Parameters • Add P(Xt=si,Xt+1=sj|e1:T) to count(i,j) • Add P(Xt=si|e1:T) to count(i) • Add P(Xt=si|e1:T) to count(i,et) • Updated aij= count(i,j)/count(i); • Updated bjet=count(j,et)/count(j)

Forward Probability

Backward Probability

Xt=si Xt+1=sj bj(t+1) ai(t) aijbjet t-1 t t+1 t+2

Speech Recognition

Phones

Speech Signal • Waveform • Spectrogram

Feature Extraction Frame 1 Frame 2 Feature VectorX1 Feature VectorX2

Speech input Acoustic analysis x x 1 T ... Phoneme inventory P ( x x w w ) ... ... | 1 T 1 k Global search: Maximize Pronunciation lexicon w w ... 1 k P ( x x w w ) ... ... | 1 T 1 k ... P ( w w ) Language model 1 k over Recognized word sequence P ( x x w w ) P ( w w ) ... ... ... | ・ 1 T 1 k 1 k Speech System Architecture

HMM for Speech Recognition a24 Word Model a11 a22 a33 a01 a12 a23 a34 start0 n1 iy2 d3 end4 b1(o3) b1(o5) b1(o1) b1(o2) b1(o6) b1(o4) ObservationSequence … … o1 o2 o3 o4 o5 o6

Language Modeling • Goal: determine which sequence of words is more likely: • I went to a party • Eye went two a bar tea • Rudolph the red nose reindeer. • Rudolph the Red knows rain, dear. • Rudolph the Red Nose reigned here.

Summary • HMM • Filtering • Decoding • Learning • Speech Recognition • Feature extraction from signal • HMM for speech recognition

Hidden Markov Models