580 likes | 821 Views
Hidden Markov Models (HMMs) Chapter 3 (Duda et al.) – Section 3.10 ( Warning : this section has lots of typos). CS479/679 Pattern Recognition Spring 2013 – Dr. George Bebis. Sequential vs Temporal Patterns. Sequential patterns: The order of data points is irrelevant. Temporal patterns:
E N D
Hidden Markov Models (HMMs)Chapter 3 (Duda et al.) – Section 3.10(Warning: this section has lots of typos) CS479/679 Pattern RecognitionSpring 2013 – Dr. George Bebis
Sequential vs Temporal Patterns • Sequential patterns: • The order of data points is irrelevant. • Temporal patterns: • The order of data points is important (i.e., time series). • Data can be represented by a number of states. • States at time t are influenced directly by states in previous time steps (i.e., correlated) .
Hidden Markov Models (HMMs) • HMMs are appropriate for problems that have an inherent temporality. • Speech recognition • Gesture recognition • Human activity recognition
First-Order Markov Models • Represented by a graph where every node corresponds to a state ωi. • The graph can be fully-connected with self-loops.
First-Order Markov Models (cont’d) • Links between nodes ωi and ωj are associated with a transition probability: P(ω(t+1)=ωj / ω(t)=ωi )=αij which is the probability of going to state ωjat time t+1 given that the state at time t was ωi (first-order model).
First-Order Markov Models (cont’d) • Markov models are fully described by their transition probabilities αij • The following constraints should be satisfied:
Example: Weather Prediction Model • Assume three weather states: • ω1: Precipitation (rain, snow, hail, etc.) • ω2: Cloudy • ω3: Sunny ω1 Transition Matrix ω2 ω1ω2ω3 ω1 ω2 ω3 ω3
Computing the probability P(ωT) of a sequence of states ωT • Given a sequence of states ωT=(ω(1), ω(2),..., ω(T)),the probability that the model generated ωT is equal to the product of the corresponding transition probabilities: where P(ω(1)/ ω(0))=P(ω(1))is the prior probability of the first state.
Example: Weather Prediction Model (cont’d) • What is the probability that the weather for eight consecutive days is: “sunny-sunny-sunny-rainy-rainy-sunny-cloudy-sunny” ? ω8=ω3ω3ω3ω1ω1ω3ω2ω3 P(ω8)=P(ω3)P(ω3/ω3)P(ω3/ω3) P(ω1/ω3) P(ω1/ω1) P(ω3/ω1)P(ω2/ω3)P(ω3/ω2)=1.536 x 10-4
Limitations of Markov models • In Markov models, each state is uniquely associated with an observable event. • Once an observation is made, the state of the system is trivially retrieved. • Such systems are not of practical use for most applications.
Hidden States and Observations • Assume that each state can generate a number of outputs (i.e., observations) according to some probability distribution. • Each observation can potentially be generated at any state. • State sequence is not directly observable (i.e., hidden) but can be approximated from observation sequence.
First-order HMMs • Augment Markov model such that when it is in state ω(t)it also emits some symbol v(t) (visible state) among a set of possible symbols. • We have access to the visible states v(t) only, while ω(t) are unobservable.
Example: Weather Prediction Model (cont’d) v1: temperature v2: humidity etc. Observations:
Observation Probabilities • When the model is in state ωj at time t, the probability of emitting a visible state vkat that time is denoted as: P(v(t)=vk / ω(t)= ωj)=bjk where (observation probabilities) • For every sequence of hidden states, there is an associated sequence of visible states: ωT=(ω(1), ω(2),..., ω(T)) VT=(v(1), v(2),..., v(T))
Absorbing State ω0 • Given a state sequence and its corresponding observation sequence: ωT=(ω(1), ω(2),..., ω(T)) VT=(v(1), v(2),..., v(T)) we assume thatω(T)=ω0 is some absorbing state, which uniquely emits symbol v(T)=v0 • Once entering the absorbing state, the system can not escape from it.
HMM Formalism • An HMM is defined by {Ω, V, P, A, B}: • Ω : {ω1… ωn } are the possible states • V : {v1…vm } are the possible observations • P = {pi} are the prior state probabilities • A = {aij} are the state transition probabilities • B = {bik} are the observation state probabilities
Some Terminology • Causal: the probabilities depend only upon previous states. • Ergodic: Given some starting state, every one of the states has a non-zero probability of occurring. “left-right” HMM
Coin toss example • You are in a room with a barrier (e.g., a curtain) through which you cannot see what is happening on the other side. • On the other side of the barrier is another person who is performing a coin (or multiple coin) toss experiment. • The other person will tell you only the result of the experiment, not how he obtained that result. e.g., VT=HHTHTTHH...T=v(1),v(2), ..., v(T)
Coin toss example (cont’d) • Problem: derive an HMM model to explain the observed sequence of heads and tails. • The coins represent the hiddenstates since we do not know which coin was tossed each time. • The outcome of each toss represents an observation. • A “likely” sequence of coins (state sequence) may be inferred from the observations. • The state sequence might not be unique in general.
Coin toss example: 1-fair coin model • There are 2 states, each associated with either heads (state1) or tails (state2). • Observation sequence uniquely defines the states (i.e., states are nothidden). observation probabilities
Coin toss example: 2-fair coins model • There are 2 states, each associated with a coin; a third coin is used to decide which of the fair coins to flip. • Neither state is uniquely associated with either heads or tails. observation probabilities
Coin toss example: 2-biased coins model • There are 2 states, each associated with a biased coin; a third coin is used to decide which of the biased coins to flip. • Neither state is uniquely associated with either heads or tails. observation probabilities
Coin toss example:3-biased coins model • There are 3 states, each state associated with a biased coin; we decide which coin to flip using some way (e.g., other coins). • Neither state is uniquely associated with either heads or tails. observation probabilities
Which model is best? • Since the states are not observable, the best we can do is to select the model θ that best explains the observations: maxθ P(VT / θ) • Long observation sequences are typically better in selecting the best model.
Classification Using HMMs • Given an observation sequence VT and set of possible models θ, choose the model with the highest probability P(θ / VT) . Bayes rule:
Three basic HMM problems • Evaluation • Determine the probability P(VT) that a particular sequence of visible states VT was generated by a given model (i.e., Forward/Backward algorithm). • Decoding • Given a sequence of visible states VT, determine the most likely sequence of hidden states ωT that led to those observations (i.e., using Viterbi algorithm). • Learning • Given a set of visible observations, determine aij and bjk (i.e., using EM algorithm - Baum-Welch algorithm).
Evaluation • The probability that a model produces VT can be computed using the theorem of total probability: where ωrT=(ω(1), ω(2),..., ω(T)) is a possible state sequence and rmaxis the max number of state sequences. • For a model with c states ω1, ω2,..., ωc , rmax=cT
Evaluation (cont’d) • We can rewrite each term as follows: • Combining the two equations we have:
Evaluation (cont’d) • Given aij and bjk, it is straightforward to compute P(VT). • What is the computational complexity? O(T rmax)=O(T cT)
Recursive computation of P(VT) (HMM Forward) ω(1) ω(t) ω(t+1) ω(T) ωi ωj ... v(t+1) v(T) v(1) v(t)
Recursive computation of P(VT)(HMM Forward) (cont’d) using marginalization: or
Recursive computation of P(VT)(HMM Forward) (cont’d) for j=1 to c do (if t=T, j=0) (i.e., corresponds to state ω(T)=ω0) • What is the computational complexity in this case? O(T c2)
Example ω0 ω1ω2 ω3 ω0 ω1 ω2 ω3 ω0 ω1 ω2 ω3
Example (cont’d) VT =v1 v3 v2 v0 • Similarly for t=2,3,4 • Finally: 0.2 initial state 0.2 0.8
ω(1) v(1) Recursive computation of P(VT)(HMM backward) βj(t+1) /ω (t+1)=ωj) βi(t) i βi(t) ωi ω(t) ω(t+1) ω(T) ωi ωj ... v(t) v(t+1) v(T)
ω(1) v(1) Recursive computation of P(VT)(HMM backward) (cont’d) =ωj)) or i ω(t) ω(t+1) ω(T) ωi ωj v(t) v(t+1) v(T)
Decoding • Find the most probable sequence of hidden states. • Use an optimality criterion - different optimality criteria lead to different solutions. • Algorithm 1: choose the states ω(t) which are individually most likely.
Decoding (cont’d) • Algorithm 2: at each time step t, find the state that has the highest probability αi(t) (i.e., use forward algorithm with minor changes).
Decoding – Algorithm 2 (cont’d) • There is no guarantee that the path is a valid one. • The path might imply a transition that is not allowed by the model. Example: 0 1 2 3 4 not allowed since ω32=0
Decoding (cont’d) • Algorithm 3: find the single best sequence ωT by maximizing P(ωT/VT) • This is the most widely used algorithm known as Viterbi algorithm.
Decoding – Algorithm 3 maximize: P(ωT/VT)
Decoding – Algorithm 3 (cont’d) recursion (similar to Forward Algorithm, except that it uses maximization over previous states instead of summation\)
Learning • Determine the transition and emission probabilities aij and bjv from a set of training examples (i.e., observation sequences V1T, V2T,..., VnT). • There is no known way to find the ML solution analytically. • It would be easy if we knew the hidden states • Hidden variable problem use EM algorithm!
Learning (cont’d) • EM algorithm • Update aij and bjk iteratively to better explain the observed training sequences. V: V1T, V2T,..., VnT • Expectation step: p(ωT/V, θ) • Maximization step: θt+1=argmax θ E[log p(ωT,VT/ θ)/ VT, θt]
Learning (cont’d) • Updating transition/emission probabilities: