Hidden Markov Models

Hidden Markov Models Hsin-min Wang whm@iis.sinica.edu.tw References: L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter 6 X. Huang et. al., (2001) Spoken Language Processing, Chapter 8 L. R. Rabiner, (1989) “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE, vol. 77, No. 2, February 1989

Hidden Markov Model (HMM) • History • Published in Baum’s papers in late 1960s and early 1970s • Introduced to speech processing by Baker (CMU) and Jelinek (IBM) in the 1970s • Introduced to DNA sequencing in 1990s? • Assumption • Speech signal (DNA sequence) can be characterized as a parametric random process • Parameters can be estimated in a precise, well-defined manner • Three fundamental problems • Evaluation of probability (likelihood) of a sequence of observations given a specific HMM • Determination of a best sequence of model states • Adjustment of model parameters so as to best account for observed signal/sequence

0.34 S1 {A:.34,B:.33,C:.33} 0.33 0.33 0.33 0.33 0.34 0.33 0.34 S2 S3 0.33 {A:.33,B:.34,C:.33} {A:.33,B:.33,C:.34} Training set for class 1: 1. ABBCABCAABC 2. ABCABC 3. ABCA ABC 4. BBABCAB 5. BCAABCCAB 6. CACCABCA 7. CABCABCA 8. CABCA 9. CABCA Training set for class 2: 1. BBBCCBC 2. CCBABB 3. AACCBBB 4. BBABBAC 5. CCAABBAB 6. BBBCCBAA 7. ABBBBABA 8. CCCCC 9. BBAAA Hidden Markov Model (HMM) back Given an initial model as follows: We can train HMMs for the following two classes using their training data respectively. We can then decide which class do the following testing sequences belong to?ABCABCCABAABABCCCCBBB

The Markov Chain • An Observable Markov Model • The parameters of a Markov chain, with N states labeled by {1,…,N} and the state at time t in the Markov chain denoted as qt, can be described as aij=P(qt= j|qt-1=i) 1≤i,j≤N i =P(q1=i) 1≤i≤N • The output of the process is the set of states at each time instant t, where each state corresponds to an observable event Xi • There is one-to-one correspondence between the observable sequence and the Markov chain state sequence (observation is deterministic!) First-order Markov chain (Rabiner 1989)

0.6 S1 A 0.3 0.3 0.1 0.1 0.2 0.5 S2 S3 0.7 0.2 C B The Markov Chain – Ex 1 • Example 1 : a 3-state Markov Chain  • State 1 generates symbol Aonly, State 2 generates symbol Bonly, State 3 generates symbol Conly • Given a sequence of observed symbols O={CABBCABC}, the only one corresponding state sequence is Q={S3S1S2S2S3S1S2S3}, and the corresponding probability is P(O|)=P(CABBCABC|)=P(Q|  )=P(S3S1S2S2S3S1S2S3 |) =π(S3)P(S1|S3)P(S2|S1)P(S2|S2)P(S3|S2)P(S1|S3)P(S2|S1)P(S3|S2)=0.10.30.30.70.20.30.30.2=0.00002268

The Markov Chain – Ex 2 • Example 2: A three-state Markov chain for the Dow Jones Industrial average (Huang et al., 2001) The probability of 5 consecutive up days

Extension to Hidden Markov Models • HMM: an extended version of Observable Markov Model • The observation is a probabilistic function (discrete or continuous) of a state instead of an one-to-one correspondence of a state • The model is a doubly embedded stochastic process with an underlying stochastic process that is not directly observable (hidden) • What is hidden? The State Sequence!According to the observation sequence, we are not sure which state sequence generates it!

0.6 Initial model S1 {A:.3,B:.2,C:.5} 0.3 0.3 0.1 0.1 0.2 0.5 S2 S3 0.7 0.2 {A:.7,B:.1,C:.2} {A:.3,B:.6,C:.1} Hidden Markov Models – Ex 1 • Example : a 3-state discrete HMM  • Given a sequence of observations O={ABC}, there are 27 possible corresponding state sequences, and therefore the probability, P(O|), is

Hidden Markov Models – Ex 2 Given a three-state Hidden Markov Model for the Dow Jones Industrial average as follows: (Huang et al., 2001) • How to find the probabilityP(up, up, up, up, up|)? • How to find the optimal state sequence of the model which generates the observation • sequence “up, up, up, up, up”?

Elements of an HMM • An HMM is characterized by the following: • N, the number of states in the model • M, the number of distinct observation symbols per state • The state transition probability distribution A={aij}, where aij=P[qt+1=j|qt=i], 1≤i,j≤N • The observation symbol probability distribution in statej, B={bj(vk)} , where bj(vk)=P[ot=vk|qt=j], 1≤j≤N, 1≤k≤M • The initial state distribution={i}, where i=P[q1=i], 1≤i≤N • For convenience, we usually use a compact notation =(A,B,) to indicate the complete parameter set of an HMM • Requires specification of two model parameters (N and M)

Two Major Assumptions for HMM • First-order Markov assumption • The state transition depends only on the origin and destination • The state transition probability is time invariant • Output-independent assumption • The observation is dependent on the state that generates it, not dependent on its neighbor observations aij=P(qt+1=j|qt=i), 1≤i, j≤N

Three Basic Problems for HMMs • Given an observation sequence O=(o1,o2,…,oT),and an HMM =(A,B,) • Problem 1:How to efficiently compute P(O|)? Evaluation problem • Problem 2:How to choose an optimal state sequence Q=(q1,q2,……, qT)which best explains the observations?Decoding Problem • Problem 3:How to adjust the model parameters =(A,B,)to maximizeP(O|)?Learning/Training Problem

Solution to Problem 1 - Direct Evaluation Given O and , find P(O|)= Pr{observing O given } • Evaluating all possible state sequences of length Tthat generate observation sequence O • : The probability of the path Q • By first-order Markov assumption • : The joint output probability along the path Q • By output-independent assumption

State S3 S3 S3 S3 S3 S2 S2 S2 S2 S2 S1 S1 S1 S1 S1 1 2 3 T-1 T Time o1 o2 o3 oT-1 oT Solution to Problem 1 - Direct Evaluation (cont.) Si means bj(ot) has been computed aij means aij has been computed

Solution to Problem 1 - Direct Evaluation (cont.) • Huge Computation Requirements: O(NT)(NT state sequences) • Exponential computational complexity • A more efficient algorithm can be used to evaluate • The Forward Procedure/Algorithm

Solution to Problem 1 - The Forward Procedure • Base on the HMM assumptions, the calculation of and involves only qt-1, qt, and ot , so it is possible to compute the likelihood with recursion on t • Forward variable : • The probability of the joint event that o1,o2,…,otare observed and the state at time tisi, given the model λ

Output-independent assumption First-order Markov assumption Solution to Problem 1 - The Forward Procedure (cont.)

Solution to Problem 1 - The Forward Procedure (cont.) • 3(2)=P(o1,o2,o3,q3=2|) =[2(1)*a12+ 2(2)*a22 +2(3)*a32]b2(o3) State S3 S3 S3 S3 S3 a32 2(3) b2(o3) S2 S2 S2 S2 S2 a22 2(2) a12 S1 S1 S1 S1 S1 2(1) 1 2 3 T-1 T Time o1 o2 o3 oT-1 oT Si means bj(ot) has been computed aij means aij has been computed

Solution to Problem 1 - The Forward Procedure (cont.) • Algorithm • Complexity: O(N2T) • Based on the lattice (trellis) structure • Computed in a time-synchronous fashion from left-to-right, where each cell for time t is completely computed before proceeding to time t+1 • All state sequences, regardless how long previously, merge to N nodes (states) at each time instance t

α2(1)= (0.35*0.6+0.02*0.5+0.09*0.4)*0.7 α1(1)=0.5*0.7 α1(2)= 0.2*0.1 α1(3)= 0.3*0.3 Solution to Problem 1 - The Forward Procedure (cont.) • A three-state Hidden Markov Model for the Dow Jones Industrial average a11=0.6 a21=0.5 b1(up)=0.7 π1=0.5 b1(up)=0.7 a31=0.4 π2=0.2 b2(up)= 0.1 b2(up)= 0.1 π3=0.3 b3(up)=0.3 b3(up)=0.3 (Huang et al., 2001)

Solution to Problem 2 - The Viterbi Algorithm • The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm • Instead of summing up probabilities from different paths coming to the same destination state, the Viterbi algorithm picks and remembers the best path • Find a single optimal state sequence Q=(q1,q2,……, qT) • The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

Solution to Problem 2 - The Viterbi Algorithm (cont.) State S3 S3 S3 S3 S3 S2 S2 S2 S2 S2 S1 S1 S1 S1 S1 1 2 3 T-1 T Time o1 o2 o3 oT-1 oT

Solution to Problem 2 - The Viterbi Algorithm (cont.) • Initialization • Induction • Termination • Backtracking is the best state sequence Complexity: O(N2T)

δ1(1)=0.5*0.7 δ2(1) =max (0.35*0.6, 0.02*0.5, 0.09*0.4)*0.7 a11=0.6 a21=0.5 b1(up)=0.7 π1=0.5 b1(up)=0.7 δ2(1)= 0.35*0.6*0.7=0.147 Ψ2(1)=1 a31=0.4 δ1(2)= 0.2*0.1 π2=0.2 b2(up)= 0.1 b2(up)= 0.1 δ1(3)= 0.3*0.3 π3=0.3 b3(up)=0.3 b3(up)=0.3 Solution to Problem 2 - The Viterbi Algorithm (cont.) • A three-state Hidden Markov Model for the Dow Jones Industrial average (Huang et al., 2001)

Solution to Problem 3 – The Baum-Welch Algorithm • How to adjust (re-estimate) the model parameters =(A,B,) to maximize P(O|)? • The most difficult one among the three problems, because there is no known analytical method that maximizes the joint probability of the training data in a closed form • The data is incomplete because of the hidden state sequence • The problem can be solved by the iterative Baum-Welch algorithm, also known as the forward-backward algorithm • The EM (Expectation Maximization) algorithm is perfectly suitable for this problem

S3 S3 S2 S2 S1 S3 Solution to Problem 3 – The Backward Procedure • Backward variable : • The probability of the partial observation sequence ot+1,ot+2,…,oT, given state i at time t and the model  • 2(3)=P(o3,o4,…, oT|q2=3,) =a31* b1(o3)*3(1)+a32* b2(o3)*3(2)+a33* b3(o3)*3(3) State S3 S3 S3 S3 S2 S2 S2 S2 a31 S1 S1 S1 S1 b1(o3) 3(1) 1 2 3 T-1 T Time o1 o2 o3 oT-1 oT

cf. Solution to Problem 3 – The Backward Procedure (cont.) • Algorithm

Solution to Problem 3 – The Forward-Backward Algorithm • Relation between the forward and backward variables (Huang et al., 2001)

Solution to Problem 3 – The Forward-Backward Algorithm (cont.)

Solution to Problem 3 – The Intuitive View • Define two new variables:t(i)= P(qt = i | O, ) • Probability of being in state i at time t, givenOand t( i, j )=P(qt = i, qt+1 = j | O, ) • Probability of being in state i at time t and state jat time t+1, givenOand

S3 S3 s1 S3 S3 s1 s1 S2 S2 S2 s2 S2 s2 S2 S1 S1 S1 s3 S1 s3 S1 Solution to Problem 3 – The Intuitive View (cont.) • P(q3 = 3, O |)=3(3)*3(3) 3(3) 3(3) State S3 S3 S2 S1 1 2 3 4 T-1 T Time o1 o2 o3 oT-1 oT

S3 S3 S3 s1 s1 s1 s2 S2 S2 S2 s2 S2 s3 S1 S1 S1 s3 S1 Solution to Problem 3 – The Intuitive View (cont.) • P(q3 = 3, q4 = 1, O |)=3(3)*a31*b1(o4)*4(1) 3(3) State S3 S3 S3 a31 S2 S2 S1 S1 4(1) b1(o4) 1 2 3 4 T-1 T Time o1 o2 o3 oT-1 oT

Solution to Problem 3 – The Intuitive View (cont.) • t( i, j )=P(qt = i, qt+1 = j | O,) • t(i)= P(qt = i | O,)

Solution to Problem 3 – The Intuitive View (cont.) • Re-estimation formulae for  , A, andB are

Hidden Markov Models