Hidden Markov Models

Hidden Markov Models part 2 CBB 231 / COMPSCI 261

Recall: Training Unambiguous Models transitions emissions

Training Ambiguous Models • The Problem: • We have training sequences, but not the associated paths (state labels) • Two Solutions: • 1. Viterbi Training: Start with random HMM parameters. Use Viterbi to find the most probable path for each training sequence, and then label the sequence with that path. Use labeled sequence training on the resulting set of sequences and paths. • 2. Baum-Welch Training: sum over all possible paths (rather than the single most probable one) to estimate expected counts Ai,j and Ei,k; then use the same formulas as for labeled sequence training on these expected counts:

Viterbi Training training features Labeled Sequence Trainer initial submodel M final submodel M* Viterbi Decoder Sequence Labeler labeled features paths {i} new submodel M repeat n times (iterate over the training sequences) (find most probable path for this sequence) (label the sequence with that path)

Recall: The Forward Algorithm F(i,k) represents the probability P(x0...xk-1, qi) that the machine emits the subsequence x0...xk-1 by any path ending in state qi—i.e., so that symbol xk-1 is emitted by state qi.

The Backward Algorithm B(i,k) = probability that the machine M will emit the subsequence xk...xL-1 and then terminate, given that M is currently in state qi (which has already emitted xk-1). THEREFORE:

Backward Algorithm: Pseudocode

Baum-Welch: Summing over All Paths all left paths (i,k-1) all right paths F(i,k)B(i,k) = P(M emits x0...xk-1...xL-1, with xk-1 being emitted by state qi). F(i,k)Pt(qj|qi)Pe(xk|qj)B( j,k+1) = P(M emits x0...xk-1xk...xL-1 and transitions from state qi to qj at time k-1→k).

Combining Forward & Backward F(i,k) = P(x0...xk-1,qi) = P(M emits x0...xk-1 by any path ending in state qi, with xk-1 emitted by qi). B(i,k) = P(xk...xL-1|qi) = P(M emits xk...xL-1 and then terminates, given that M is in state qi, which has emitted xk-1). F(i,k)B(i,k) = P(x0...xk-1,qi)P(xk...xL-1|qi)=P(x0...xL-1,qi)* F(i,k)B(i,k)/P(S) = P(qi, k-1 | S) where C(qi,k)=1 if qi=q and xi=s; otherwise 0. where C(qm,qn,k)=1 if qm=qi and qn=qj; otherwise 0. *assuming P(xk...xL-1) is conditionally independent ofP(x0...xk-1), givenqi.

Baum-Welch Training compute Fwd & Bkwd DP matrices accumulate expected counts for E & A

Using Logarithms in Forward & Backward (6.43) In the log-space version of these algorithms, we can replace the raw probabilities pi with their logarithmic counterparts, log pi, and apply the Equation (6.43) whenever the probabilities are to be summed. Evaluation of the elog pi-log p0 term should generally not result in numerical underflow in practice, since this term evaluates to pi/p0, which for probabilities of similar events should not deviate too far from unity. (due to Kingsbury & Rayner, 1971)

Monotonic Convergence Behavior

Posterior Decoding all left parses all right parses putative exon fixed path Backward algorithm Forward algorithm

Posterior Decoding

Summary • Training of ambiguous HMM’s can be accomplished using Viterbi training or the Baum-Welch algorithm • Viterbi training performs labeling using the single most probable path • Baum-Welch training instead estimates transition & emission events by computing expectations via Forward-Backward, by summing over all paths containing a given event • Posterior decoding can be used to estimate the probability that a given symbol or substring was generate by a particular state

Hidden Markov Models