Hidden Markov Models

Hidden Markov Models M. Vijay Venkatesh

Outline • Introduction • Graphical Model • Parameterization • Inference • Summary

Introduction • Hidden Markov Model (HMM) is a graphical model for modeling sequential data. • The states are no longer independent, but a state at any given step depends on the choice of the previous step. • Generalizations of Mixture models with a transition matrix linking states at neighboring steps.

Introduction • Inferencing in HMM involves having the observed data as input and yielding a probability distribution on the underlying states. • Since the states are dependent, it is a little more involved than inferencing for mixture models.

Generation of data for IID and HMM case

Graphical Model Q1 Q2 Q0 QT A A π Y0 Y2 YT Y1 Top node in each slice represents the multinomial Qt variable and the bottom node represents the observable Yt variable

Graphical Model • Conditioning on state Qt renders Qt-1 and Qt+1 independent. • Generally, Qs is independent of Qu, for s<t and t<u. • This is also for output nodes Ys and Yu, when conditioned on state node Qt. • Conditioning on output node, does not yield any conditional independence. • Indeed, conditioning on all output nodes fails to induce any independencies on state nodes.

Parameterization • State transition matrix A where of A is defined as the transition probability • Each output node has a single state node as a parent, therefore we require probability • For a particular configuration, the joint probability is expressed as

Parameterization • To introduce A and π parameters in the joint probability equation, we re-write the transition matrix indices and the unconditional initial node distribution as • We get the joint probability as

Inferencing • The general inference problem is to compute the probability of hidden state q given an observable output sequence y. • Marginal probability of a particular hidden state qtgiven output sequence. • Probabilities conditioned on partial output • Filtering • Prediction • Smoothing, where we calculate a posterior probability based on data up to and including future time

Inference • Let’s calculate where , is the entire observable output. • We can calculate • But to calculate , we need to sum across all possible values of hidden states • Each state can take M possible values and we have T state nodes, which implies that we must perform MT sums

Inference • Each factor involves only one or two state variables. • It is possible to move those sums inside the product to do it in a systematic way • Moving sum inside and forming a recursive form, reduces computation significantly

Inferencing • Rather than computing P(q|y) we focus on a particular state node qt and calculate P(qt|y) • We take advantage of conditional independencies and Bayes rule Qt+1 Qt A Yt+1 Yt

Inferencing • where α(qt) is the probability of emitting a partial sequence of outputs y0, …,yt and ending at state qt • where β(qt) is the probability of emitting a partial sequence of outputs yt+1, …,yT starting at state qt

Inferencing • Reduced to finding α,β • We hope to obtain a recursive relation between α(qt) and α(qt+1) • The required time is O(M2T) and the algorithm proceeds forward in time • Similarly we obtain a recursive backward relation between β(qt) and β(qt+1) • To compute posterior probabilities for all states qt, we are required to compute alphas and betas for each step.

Alternate inference algorithm • An alternative approach in which the backward phase is a recursion defined on γ(qt) variable • Backward phase does not use yt; only the forward phase does. We can throw data as we filter.

Alternate Inference algorithm • This recursion makes use of the α variables, and hence must be computed before γ recursion • The data yt are not used in γ recursion; the α recursion has absorbed all the necessary likelihoods

Transition matrix • The α-β or α-γ algorithm provides us with posterior probability of the state • To estimate state transition matrix A, we need the matrix of co-occurrence prob. P(qt,qt+1|y) • We calculate ξ(qt,qt+1) based on alphas and betas

Junction Tree Connection • We can calculate all the posterior probability for HMM recursively • Given an observed sequence y, we run α-recursion forward in time • If we require likelihood, we simply sum the alphas at final time step • If we require posterior probabilities of the states, we use either β or γ-recursion

Junction tree connection • HMM is represented by multinomial state variable Qt and the observable output variable yt • HMM is parameterized by initial probability π and each subsequent state node with a transition matrix A where • The output nodes are assigned the local conditional probability . We assume that yt is a multinomial node so that can be viewed as a matrix B. • To convert HMM to Junction Tree, we moralize, triangulate and form the clique tree. Then we choose a maximal spanning tree which forms our junction tree.

Junction tree connection • Moralized and triangulated graph • The junction tree for HMM with potentials labeled

Junction tree Connection • The initial probability as well as the conditional prob. is assigned to the potential , which implies that this potential is initially set to • The state to state potentials are given the assignment , the output probabilities are assigned the potential and the separator potentials are initialized to one.

Unconditional Inference • Lets do inferencing before any evidence is observed and we designate the node as the root and collect to the root. • Consider the first operation of passing a message upward from a clique for t>1. • The marginalization yields • Thus the separator potential remains set at one. • This implies that the update factor is one and thus the potential remains unchanged. • In general , the messages passed upward from leaves have no effect when no evidence is observed

Unconditional inference • Now consider message from (qo,y0) to (q0,q1) • This transformation propagates forward along the chain, changing separator potentials on qt into marginals P(qt) and the clique potentials into marginals P(qt,qt+1) • A subsequent distribute evidence will have no effect on potentials along the backbone of the chain, but will convert into marginals P(qt) and the potentials Ψ(qt, yt) into marginals P(qt, yt)

Unconditional inference • Thus all potentials throughout the junction tree become marginal probabilities • Our results helps to clarify the representation of the joint probability as the product of the clique potentials divided by the product of the separator potentials.

Junction Tree Algorithm • Moralize if needed • Triangulate using any triangulation algorithm • Formulate the clique graph (clique nodes and separator nodes) • Compute the junction tree • Initialize all separator potentials to be 1. • Phase 1: Collect from children • Phase 2: Distribute to children Message from children C: *(XS)=C\S(XC) Update at parent P: *(XP)= (XP) S *(XS)/(XS) Message from parent P: **(XP)=P\S**(XP) Update at child C: *(XC)= (XC) S **(XS)/*(XS)

Introducing evidence • We now suppose that outputs y are observed and we wish to compute P(y) as well as marginal posterior prob. such as P(qt|y) and P(qt,qt+1|y). • Initialize separator potentials to unity and recall that Ψ(qt,yt) can be viewed as a matrix B, with columns labeled by possible values of yt. • In practice we would set the separator potential • We designate (QT-1, QT) as the root of the JT and collect to the root

Collecting to the root • Consider update of clique( Qt, Qt+1) as shown and we assume that Φ*(qt) has already been updated and consider the computation of Ψ*(qt,qt+1) and Φ*(qt+1) • Ψ*(qt,qt+1) = Ψ(qt,qt+1) Φ*(qt) ς*(qt+1) = aqt,qt+1Φ*(qt) P(yt+1|qt+1)

Collecting to the root • Proceeding forward along the chain • Defining α(qt) = Φ*(qt), we have recovered the alpha algorithm • The collect phase of algorithm terminates with update of Ψ(qT-1, qT). The updated potential will equal p(yo,…yt,qt,qt+1) and thus by marginalization we get likelihood

Collecting to the root • Suppose instead of designating (qT-1, qT) as the root, if we utilize (qo,q1) as the root, we obtain the beta algorithm. • It is not necessary to change the root of the JT to derive the beta algorithm. It arises during the DistributeEvidence pass when having (qT-1,qT) as the root.

Distributing from root • Now in the second phase we want to distribute evidence from the root (qT-1,qT) • This phase proceeds backwards along state-state as well as state-output cliques

Distribute from the root • We suppose that the separator potential Φ**(qt+1) has already been updated and consider the update of Ψ**(qt,qt+1) and Φ**(qt) and • Simplifying we obtain Gamma recursion

Distribution from the root • By rearranging and simplifying we can also derive a relationship between alpha-beta recursion

Hidden Markov Models