PatReco: Hidden Markov Models

PatReco: Hidden Markov Models Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005

Markov Models: Definition • Markov chains are Bayesian networks that model sequences of events (states) • Sequential events are dependent • Two non-sequential events are conditionally independent given the intermediate events (MM-1)

Markov chains … q1 q1 q2 q3 q4 MM-0 MM-1 q0 q1 q2 q3 q4 … … q0 q1 q2 q3 q4 MM-2 … MM-3 q0 q1 q2 q3 q4

Markov Chains MM-0: P(q1,q2.. qN) = n=1..N P(qn) MM-1: P(q1,q2.. qN) = n=1..N P(qn|qn-1) MM-2: P(q1,q2.. qN) = n=1..N P(qn|qn-1,qn-2) MM-3: P(q1,q2.. qN) = n=1..N P(qn|qn-1,qn-2,qn-3)

Hidden Markov Models • Hidden Markov chains model sequences of events and corresponding sequences of observations • Events form an Markov chain (MM-1) • Observations are conditionally independent given the sequence of events • Each observation is directly connected with a single event (and conditionally independent with the rest of the events in the network)

q0 q1 q2 q3 q4 … o0 o1 o2 o3 o4 … Hidden Markov Models HMM-1 P(o0,o1..oN ,q0,q1..qN) = n=0..N P(qn|qn-1)P(on|qn)

Parameter Estimation • The parameters that have to be estimated are the • a-priori probabilities P(q0) • transition probabilities P(qn|qn-1) • observation probabilities P(on|qn) • For example if there are 3 types of events and continuous 1-D observations that follow a Gaussian distribution there are 18 parameters to estimate: • 3 a-priori probabilities • 3x3 transition probabilities matrix • 3 means and 3 variances (observation probabilities)

Parameter Estimation • If both the sequence of events and sequences of observations are fully observable then ML is used • Usually the sequence of events q0,q1..qNare non-observable in which case EM is used • The EM algorithm for HMMs is the Baum-Welsh or forward-backward algorithm

Inference/Decoding • The main inference problem for HMMs is known as the decoding problem: given a sequence of observations find the best sequence of states: q = argmaxq P(q|O) = argmaxq P(q,O) • An efficient decoding algorithm is the Viterbi algorithm

Viterbi algorithm maxq P(q,O) = maxq P(o0,o1..oN ,q0,q1..qN) = maxq n=0..N P(qn|qn-1)P(on|qn) = maxqN {P(oN|qN) maxqN-1{P(qN|qN-1)P(oN-1|qN-1) … maxq2{P(q3|q2)P(o2|q2) maxq1{P(q2|q1)P(o1|q1) maxq0 {P(q1|q0) P(o0|q0) P(q0)}}}…}}

Viterbi algorithm time 1 At each node keep only the best (most probable) path from all the paths passing through that node 2 3 4 . . K

Deep Thoughts • HMM-0 (HMM with MM-0 event chain) is the Bayes classifier!!! • MMs and HMMs are poor models but simple and efficient computationally • How do you fix this? (dependent observations?)

Some Applications • Speech Recognition • Optical Character Recognition • Part-of-Speech Tagging • …

Conclusions • HMMs and MMs are useful modeling tools for dependent sequence of events (states or classes) • Efficient algorithms exist for training HMM parameters (Baum-Welsh) and decoding the most probable sequence of states given an observation sequence (Viterbi) • HMMs have many applications

PatReco: Hidden Markov Models