Hidden Markov Models

Hidden Markov Models

Two learning scenarios • Estimation when the “right answer” is known Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good (experimental) annotations of the CpG islands GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls • Estimation when the “right answer” is unknown Examples: GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice QUESTION: Update the parameters  of the model to maximize P(x|)

1. When the right answer is known Given x = x1…xN for which the true  = 1…N is known, Define: Akl = # times kl transition occurs in  Ek(b) = # times state k in  emits b in x We can show that the maximum likelihood parameters  are: Akl Ek(b) akl = ––––– ek(b) = ––––––– i Akic Ek(c)

2. When the right answer is unknown We don’t know the true Akl, Ek(b) Idea: • We estimate our “best guess” on what Akl, Ek(b) are • We update the parameters of the model, based on our guess • We repeat

2. When the right answer is unknown Starting with our best guess of a model M, parameters : Given x = x1…xN for which the true  = 1…N is unknown, We can get to a provably more likely parameter set  Principle: EXPECTATION MAXIMIZATION • Estimate Akl, Ek(b) in the training data • Update  according to Akl, Ek(b) • Repeat 1 & 2, until convergence

Estimating new parameters To estimate Akl: At each position i of sequence x, Find probability transition kl is used: P(i = k, i+1 = l | x) = [1/P(x)]  P(i = k, i+1 = l, x1…xN) = Q/P(x) where Q = P(x1…xi, i = k, i+1 = l, xi+1…xN) = = P(i+1 = l, xi+1…xN | i = k) P(x1…xi, i = k) = = P(i+1 = l, xi+1xi+2…xN | i = k) fk(i) = = P(xi+2…xN | i+1 = l) P(xi+1 | i+1 = l) P(i+1 = l | i = k) fk(i) = = bl(i+1) el(xi+1) akl fk(i) fk(i) akl el(xi+1) bl(i+1) So: P(i = k, i+1 = l | x, ) = –––––––––––––––––– P(x | )

Estimating new parameters • So, fk(i) akl el(xi+1) bl(i+1) Akl = i P(i = k, i+1 = l | x, ) = i ––––––––––––––––– P(x | ) • Similarly, Ek(b) = [1/P(x)]{i | xi = b} fk(i) bk(i) fk(i) bl(i+1) akl k l xi+2………xN x1………xi-1 el(xi) xi xi+1

The Baum-Welch Algorithm Initialization: Pick the best-guess for model parameters (or arbitrary) Iteration: • Forward • Backward • Calculate Akl, Ek(b) • Calculate new model parameters akl, ek(b) • Calculate new log-likelihood P(x | ) GUARANTEED TO BE HIGHER BY EXPECTATION-MAXIMIZATION Until P(x | ) does not change much

The Baum-Welch Algorithm Time Complexity: # iterations  O(K2N) • Guaranteed to increase the log likelihood of the model P( | x) = P(x, ) / P(x) = P(x | ) / ( P(x) P() ) • Not guaranteed to find globally best parameters Converges to local optimum, depending on initial conditions • Too many parameters / too large model: Overtraining

Alternative: Viterbi Training Initialization: Same Iteration: • Perform Viterbi, to find * • Calculate Akl, Ek(b) according to * + pseudocounts • Calculate the new parameters akl, ek(b) Until convergence Notes: • Convergence is guaranteed – Why? • Does not maximize P(x | ) • In general, worse performance than Baum-Welch

Variants of HMMs

Higher-order HMMs The Genetic Code 3 nucleotides make 1 amino acid Statistical dependencies in triplets Question: Recognize protein-coding segments with a HMM

One way to model protein regions P(xixi+1xi+2 | xi-1xixi+1) Every state of the HMM emits 3 nucleotides Transition probabilities: Probability of one triplet, given previous triplet P(i, | i-1) Emission probabilities: P(xixi-1xi-2 | i ) = 1/0 P(xi-1xi-2xi-3 | i-1 ) = 1/0 AAC AAA AAT … TTT …

A more elegant way Every state of the HMM emits 1 nucleotide Transition probabilities: Probability of one triplet, given previous 3 triplets P(i, | i-1, i-2, i-3) Emission probabilities: P(xi | i) Algorithms extend with small modifications C A T G

Modeling the Duration of States 1-p Length distribution of region X: E[lX] = 1/(1-p) • Geometric distribution, with mean 1/(1-p) This is a significant disadvantage of HMMs Several solutions exist for modeling different length distributions X Y p q 1-q

Sol’n 1: Chain several states p 1-p X Y X X q 1-q Disadvantage: Still very inflexible lX = C + geometric with mean 1/(1-p)

Sol’n 2: Negative binomial distribution Duration in X: m turns, where • During first m – 1 turns, exactly n – 1 arrows to next state are followed • During mth turn, an arrow to next state is followed m – 1 m – 1 P(lX = m) = n – 1 (1 – p)n-1+1p(m-1)-(n-1) = n – 1 (1 – p)npm-n p p p 1 – p 1 – p 1 – p Y X X X ……

Example: genes in prokaryotes • EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A • Negative binomial with n = 3

Solution 3: Duration modeling Upon entering a state: • Choose duration d, according to probability distribution • Generate d letters according to emission probs • Take a transition to next state according to transition probs Disadvantage: Increase in complexity: Time: O(D2) Space: O(D) where D = maximum duration of state X

Connection Between Alignment and HMMs

A state model for alignment M (+1,+1) Alignments correspond 1-to-1 with sequences of states M, I, J I (+1, 0) J (0, +1) -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII

Let’s score the transitions s(xi, yj) M (+1,+1) Alignments correspond 1-to-1 with sequences of states M, I, J s(xi, yj) s(xi, yj) -d -d I (+1, 0) J (0, +1) -e -e -e -e -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII

How do we find optimal alignment according to this model? Dynamic Programming: M(i, j): Optimal alignment of x1…xi to y1…yjending in M I(i, j): Optimal alignment of x1…xi to y1…yj ending in I J(i, j): Optimal alignment of x1…xi to y1…yjending in J The score is additive, therefore we can apply DP recurrence formulas

Needleman Wunsch with affine gaps – state version Initialization: M(0,0) = 0; M(i,0) = M(0,j) = -, for i, j > 0 I(i,0) = d + ie; J(0,j) = d + je Iteration: M(i – 1, j – 1) M(i, j) = s(xi, yj) + max I(i – 1, j – 1) J(i – 1, j – 1) e + I(i – 1, j) I(i, j) = max e + J(i, j – 1) d + M(i – 1, j – 1) e + I(i – 1, j) J(i, j) = max e + J(i, j – 1) d + M(i – 1, j – 1) Termination: Optimal alignment given by max { M(m, n), I(m, n), J(m, n) }

Probabilistic interpretation of an alignment An alignment is a hypothesis that the two sequences are related by evolution Goal: Produce the most likely alignment Assert the likelihood that the sequences are indeed related

Hidden Markov Models