Hidden Markov Models Part 2: Algorithms

Hidden Markov ModelsPart 2: Algorithms CSE 4309 – Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Hidden Markov Model • An HMM consists of: • A set of states . • An initial state probability function. • A state transition matrix, of size , where • Observation probability functions, also called emission probabilities, defined as:

The Basic HMM Problems • We will next see algorithms for the four problems that we usually want to solve when using HMMs. • Problem 1: learn an HMM using training data. This involves: • Learning the initial state probabilities . • Learning the state transition matrix . • Learning the observation probability functions . • Problems 2, 3, 4 assume we have already learned an HMM. • Problem 2: given a sequence of observations, compute : the probability of given the model. • Problem 3: given an observation sequence , find the hidden state sequence maximizing • Problem 4: given an observation sequence , compute, for any ,, the probability

The Basic HMM Problems • Problem 1: learn an HMM using training data. • Problem 2: given a sequence of observations , compute : the probability of given the model. • Problem 3: given an observation sequence , find the hidden state sequence maximizing • Problem 4: given an observation sequence , compute, for any ,, the probability • We will first look at algorithms for problems 2, 3, 4. • Last, we will look at the standard algorithm for learning an HMM using training data.

Probability of Observations • Problem 2: given an HMM model , and a sequence of observations , compute : the probability of given the model. • Inputs: • : a trained HMM, with probability functions specified. • A sequence of observations . • Output: .

Probability of Observations • Problem 2: given a sequence of observations , compute . • Why do we care about this problem? • What would be an example application?

Probability of Observations • Problem 2: given a sequence of observations , compute . • Why do we care about this problem? • We need to compute to classify . • Suppose we have multiple models . • For example, we can have one model for digit 2, one model for digit 3, one model for digit 4… • A Bayesian classifier classifies by finding the that maximizes • Using Bayes rule, we must maximize . • Therefore, we need to compute for each .

The Sum Rule • The sum rule, which we saw earlier in the course, states that: • We can use the sum rule, to compute as follows: • According to this formula, we can compute by summing over all possible state sequences .

The Sum Rule • According to this formula, we can compute by summing over all possible state sequences . • An HMM with parametersspecifies a joint distribution function as: • Therefore:

The Sum Rule • According to this formula, we can compute by summing over all possible state sequences . • What would be the time complexity of doing this computation directly? • It would be linear to the number of all possible state sequences , which is typically exponential to (the length of ). • Luckily, there is a polynomial time algorithm for computing , that uses dynamic programming.

Dynamic Programming for • To compute , we use a dynamic programming algorithm that is called the forward algorithm. • We define a 2D array of problems, of size . • : length of observation sequence . • : number of states in the HMM. • Problem : • Compute

The Forward Algorithm - Initialization • Problem : • Compute • Problem :

The Forward Algorithm - Initialization • Problem : • Compute • Problem : • Compute • Solution: • ???

The Forward Algorithm - Initialization • Problem : • Compute • Problem : • Compute • Solution:

The Forward Algorithm - Main Loop • Problem : • Compute • Solution:

The Forward Algorithm - Main Loop • Problem : • Compute • Solution: • Thus, . • is easy to compute once all values have been computed.

The Forward Algorithm • Our aim was to compute . • Using the dynamic programming algorithm we just described, we can compute . • How can we use those values to compute • We use the sum rule:

Problem 3: Find the Best • Problem 3: • Inputs: a trained HMM , and an observation sequence • Output: the state sequence maximizing . • Why do we care? • Some times, we want to use HMMs for classification. • In those cases, we do not really care about maximizing , we just want to find the HMM parameters that maximize . • Some times, we want to use HMMs to figure out the most likely values of the hidden states given the observations. • In the tree ring example, our goal was to figure out the average temperature for each year. • Those average temperatures were the hidden states. • Solving problem 3 provides the most likely sequence of average temperatures.

Problem 3: Find the Best • Problem 3: • Input: an observation sequence • Output: the state sequence maximizing . • We want to maximize . • Using the definition of conditional probabilities: • Since is known, will be the same over all possible state sequences . • Therefore, to find the that maximizes , it suffices to find the that maximizes .

Problem 3: Find the Best • Problem 3: • Input: an observation sequence • Output: the state sequence maximizing , which is the same as maximizing . • Solution: (again) dynamic programming. • Problem : Compute and , where: • Note that: • We optimize over all possible values . • However, is constrained to be equal to . • Values are known, and thus they are fixed.

Problem 3: Find the Best • Problem 3: • Input: an observation sequence • Output: the state sequence maximizing , which is the same as maximizing . • Solution: (again) dynamic programming. • Problem : Compute and , where: • is the joint probability of: • the first observations • the sequence we stored in

Problem 3: Find the Best • Problem 3: • Input: an observation sequence • Output: the state sequence maximizing , which is the same as maximizing . • Solution: (again) dynamic programming. • Problem : Compute and , where: • is the joint probability of: • the first observations • the sequence we stored in • stores a sequence of state values. • stores a number (a probability).

The Viterbi Algorithm • Problem : Compute and , where: • We use a dynamic programming algorithm to compute and for all such that . • This dynamic programming algorithm is called the Viterbi algorithm.

The Viterbi Algorithm - Initialization • Problem : Compute and , where: • What is ? • It is the empty sequence. • We must maximize over the previous states , but , so there are no previous states. • What is ? • .

The Viterbi Algorithm – Main Loop • Problem : Compute and , where: • How do we find ? • for some and some . • The last element of is some . • Therefore, is formed by appending some to some sequence

The Viterbi Algorithm – Main Loop • Problem : Compute and , where: • for some and some . • We can prove that , for that . • The proof is very similar to the proof that DTW finds the optimal alignment. • If is better than , then is better than , which is a contradiction.

The Viterbi Algorithm – Main Loop • Problem : Compute and , where: • for some . • Let's define as: • Then: • .

The Viterbi Algorithm – Output • Our goal is to find the maximizing , which is the same as finding the maximizing. • The Viterbi algorithm computes and , where: • Let's define as: • Then, the maximizing is

State Probabilities at Specific Times • Problem 4: • Inputs: a trained HMM , and an observation sequence • Output:an array, of size , where . • In words: given an observation sequence , we want to compute, for any moment in time and any state , the probability that the hidden state at that moment is .

State Probabilities at Specific Times • Problem 4: • Inputs: a trained HMM , and an observation sequence • Output:an array, of size , where . • We have seen, in our solution for problem 2, that the forward algorithm computes an array, of size , where . • We can also define another array , of size , where. • In words, is the probability of all observations after moment , given that the state at moment is .

State Probabilities at Specific Times • Problem 4: • Inputs: a trained HMM , and an observation sequence • Output:an array, of size , where . • We have seen, in our solution for problem 2, that the forward algorithm computes an array, of size , where . • We can also define another array , of size , where. • Then, • Therefore:

State Probabilities at Specific Times • The forward algorithm computes an array, of size , where . • We define another array , of size , where. • Then, • In the above equation: • and are computed by the forward algorithm. • We need to compute . • We compute in a way very similar to the forward algorithm, but going backwards this time. • The resulting combination of the forward and backward algorithms is called (not surprisingly) the forward-backward algorithm.

The Backward Algorithm • We want to compute the values of an array , of size , where . • Again, we will use dynamic programming. • Problem is to compute value . • However, this time we start from the end of the observations. • First, compute . • In this case, the observation sequence is empty, because • What is the probability of an empty sequence?

Backward Algorithm - Initialization • We want to compute the values of an array , of size , where . • Again, we will use dynamic programming. • Problem is to compute value . • However, this time we start from the end of the observations. • First, compute . • In this case, the observation sequence is empty, because • What is the probability of an empty sequence?

Backward Algorithm - Initialization • Next:compute .

Backward Algorithm - Initialization • Next: compute .

Backward Algorithm – Main Loop • Next:compute , for .

Backward Algorithm – Main Loop • Next:compute , for . • We will take a closer look at the last step…

Backward Algorithm – Main Loop • , by definition of .

Backward Algorithm – Main Loop by application of chain rule: , which can also be written as

Backward Algorithm – Main Loop (by definition of )

Backward Algorithm – Main Loop ( are conditionally independent of given )

Backward Algorithm – Main Loop • In the previous slides, we showed that • Note that is . • Consequently, we obtain the recurrence:

Backward Algorithm – Main Loop • Thus, we can easily compute all values using this recurrence. • The key thing is to proceed in decreasing order of , since values depend on values .

The Forward-Backward Algorithm • Inputs: • A trained HMM . • An observation sequence • Output: • An array, of size , where . • Algorithm: • Use the forward algorithm to compute . • Use the backward algorithm to compute . • For to : • For to :

Problem 1: Training an HMM • Goal: Estimate parameters . • The training data can consist of multiple observation sequences . • We denote the length of training sequence as . • We denote the elements of each observation sequence as: • Before we start training, we need to decide on: • The number of states. • The transitions that we will allow.

Problem 1: Training an HMM • While we are given , we are not given the corresponding hidden state sequences . • We denote the elements of each hidden state sequence as: • The training algorithm is called Baum-Welch algorithm. • It is an Expectation-Maximization algorithm, similar to the algorithm we saw for learning Gaussian mixtures.

Expectation-Maximization • When we wanted to learn a mixture of Gaussians, we had the following problem: • If we knew the probability of each object belonging to each Gaussian, we could estimate the parameters of each Gaussian. • If we knew the parameters of each Gaussian, we could estimate the probability of each object belonging to each Gaussian. • However, we know neither of these pieces of information. • The EM algorithm resolved this problem using: • An initialization of Gaussian parameters to some random or non-random values. • A main loop where: • The current values of the Gaussian parameters are used to estimate new weights of membership of every training object to every Gaussian. • The current estimated membership weights are used to estimate new parameters (mean and covariance matrix) for each Gaussian.

Hidden Markov Models Part 2: Algorithms