Hidden Markov Models

Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven

Regular expressions • Alignment • Regular expression • Problem: regular expression does not distinguish • Exceptional TGCTAGG • Consensus ACACATC ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC [AT][CG][AC][ACGT]*A[TG][GC]

Sequence score Transition probabilities .4 A .2C .4 G .2 T .2 Emission probabilities .6 .6 A .8C 0 G 0 T .2 A 0C .8 G .2 T 0 A .8C .2 G 0 T 0 A 1C 0 G 0 T 0 A 0 C 0 G .2 T .8 A 0C .8 G .2 T 0 1.0 1.0 1.0 .4 1.0 Hidden Markov Models

Log odds • Use logarithm for scaling and normalize by random model • Log odds for sequence S: -0.92 A:-0.22C: 0.47 G:-0.22 T:-0.22 -0.51 -0.51 A: 1.16 T:-0.22 C: 1.16 G:-0.22 A: 1.16 C:-0.22 A: 1.39 G:-0.22 T: 1.16 C: 1.16 G:-0.22 0 0 0 -0.92 0

Log odds

A T C G Markov chain • Sequence: • Example of a Markov chain • Probabilistic model of a DNA sequence Transition probabilities

Markov property • Probability of a sequence through Bayes’ rule • Markov property • “The future is only function of the present and not of the past”

A T a w C G Beginning and end of a sequence • Computation of the probability is not homogeneous • Length distribution is not modeled • P(length=L) unspecified • Solution • Modeling of beginning and end of the sequence • The probability to observe a sequence of a given length decreases with the length of the sequence

Sequence score Transition probabilities .4 A .2C .4 G .2 T .2 Emission probabilities .6 .6 A .8C 0 G 0 T .2 A 0C .8 G .2 T 0 A .8C .2 G 0 T 0 A 1C 0 G 0 T 0 A 0 C 0 G .2 T .8 A 0C .8 G .2 T 0 1.0 1.0 1.0 .4 1.0 Hidden Markov Models

Hidden Markov Model • In a hidden Markov model, we observe the symbol sequence x but we want to reconstruct the hidden state sequence (pathp) • Transition probabilities (a: a0l, w: ak0) • Emission probabilities • Joint probability of the sequence a,x1,...,xL,w and the path

0.9 0.95 1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6 1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 0.05 0.1 Fair Loaded Casino (I) – problem setup • The casino uses mostly a fair die but switches sometimes to a loaded die • We observe the outcome x of the successive throws but want to know when the die was fair or loaded (pathp)

Estimation of the sequence and state probabilities

The Viterbi algorithm • We look for the most probable path p* • This problem can be tackled by dynamic programming • Let us define vk(i) as the probability of the most probable path that ends in state k for the emission of symbol xi • Then we can compute this probability recursively as

The Viterbi algorithm • The Viterbi algorithm grows the best path dynamically • Initial condition: sequence in beginning state • Traceback pointers tot follow the best path (= decoding)

Casino (II) - Viterbi

The forward algorithm • The forward algorithm let us compute the probability P(x) of a sequence w.r.t. an HMM • This is important for the computation of posterior probabilities and the comparison of HMMs • The sum over all paths (exponentially many) can be computed by dynamic programming • Les us define fk(i) as the probability of the sequence for the paths that end in state k with the emission of symbol xi • Then we can compute this probability as

The forward algorithm • The forward algorithm grows the total probability dynamically from the beginning to the end of the sequence • Initial condition: sequence in beginning state • End: all states converge to the end state

The backward algorithm • The backward algorithm let us compute the probability of the complete sequence together with the condition that symbol xi is emitted from state k • This is important to compute the probability of a given state at symbol xi • P(x1,...,xi,pi=k) can be computed by the forward algorithm fk(i) • Let us define bk(i) as the probability that the rest of the sequence for the paths that pass through state k at symbol xi

The backward algorithm • The backward algorithm grows the probability bk(i) dynamically backwards (from end to beginning) • Border condition: start in end state • Once both forward and backward probabilities are available, we can compute the posterior probability of the state

Posterior decoding • Instead of using the most probable path for decoding (Viterbi), we can use the path of the most probable states • The path p^ can be “illegal” (P(p^|x)=0) • This approach can also be used when we are interested in a function g(k) of the state (e.g., labeling)

Casino (III) – posterior decodering • Posterior probability of the state “fair” w.r.t. the die throws

0.9 0.99 1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6 1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 0.01 0.1 Fair Loaded Casino (IV) – posterior decodering • New situation : P(xi+1 = FAIR | xi = FAIR) = 0.99 • Viterbi decoding cannot detect the cheating from 1000 throws, while posterior decoding does

Parameter estimation for HMMs

Choice of the architecture • For the parameter estimation, we assume that the architecture of the HMM is known • Choice of architecture is an essential design choice • Duration modeling • “Silent states” for gaps

Parameter estimation with known paths • HMM with parameters q (transition and emission probabilities) • Training set D of N sequences x1,...,xN • Score of the model is the likelihood of the parameters given the training data

Parameter estimation with known paths • If the state paths are known, the parameters are estimated through counts (how often is a transition used, how often is a symbol produced by a given state) • Use of ‘pseudocounts’ if necessary • Akl= number of transitions from k to l in training set + pseudocount rkl • Ek(b) = number of emissions of b from k in training set + pseudocount rk(b)

Parameter estimation with unknown paths: Viterbi training • Strategy: iterative method • Suppose that the parameters are known and find the best path • Use Viterbi decoding to estimate the parameters • Iterate till convergence • Viterbi training does not maximize the likelihood of the parameters • Viterbi training converges exactly in a finite number of steps

Parameter estimation with unknown paths: Baum-Welch training • Strategy: parallel to Viterbi but we use the expected value for the transition and emission counts (instead of using only the best path) • For the transitions • For the emissions

Parameter estimation with unknown paths: Baum-Welch training • Initialization: Choose arbitrary model parameters • Recursion: • Set all transitions and emission variables to their pseudocount • For all sequences j = 1,...,n • Compute fk(i) for sequence j with the forward algorithm • Compute bk(i) for sequence j with the backward algorithm • Add the contributions to A and E • Compute the new model parameters akl =Akl/SAkl’ and ek(b) • Compute the log-likelihood of the model • End: stop when the log-likelihood does not change more than by some threshold or when the maximum number of iterations is exceeded

0.88 0.71 0.9 0.73 0.95 0.93 1: 0.19 2: 0.19 3: 0.23 4: 0.08 5: 0.23 6: 0.08 1: 0.17 2: 0.17 3: 0.17 4: 0.17 5: 0.17 6: 0.15 1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6 1: 0.07 2: 0.10 3: 0.10 4: 0.17 5: 0.05 6: 0.52 1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 1: 0.10 2: 0.11 3: 0.10 4: 0.11 5: 0.10 6: 0.48 0.27 0.07 0.05 0.12 0.29 0.1 Fair Fair Fair Loaded Loaded Loaded Casino (V) – Baum-Welch training Originalmodel 300 throws 30000 throws

Numerical stability • Many expressions contain products of many probabilities • This causes underflow when we compute these expressions • For Viterbi, this can be solved by working with the logarithms • For the forward and backward algorithms, we can work with an approximation to the logarithm or by working with rescaled variables

Summary • Hidden Markov Models • Computation of sequence and state probabilities • Viterbi computation of the best state path • The forward algorithm for the computation of the probability of a sequence • The backward algorithm for the computation of state probabilities • Parameter estimation for HMMs • Parameter estimation with known paths • Parameter estimation with unknown paths • Viterbi training • Baum-Welch training

Hidden Markov Models